Snakemake workflow: ChimereNanoporePipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

A snakemake workflow that takes fastq files as input and creates a mitochondrial genome assembly. As mentioned the workflow is made in Snakemake, integrating different bio-informatic tools as Porechop ABI, Prowler, SACRA, ... with many other Python scripts to gathering output, generating a HTML report file to visualize results of each step in the pipeline.

The main purpose is to detect chimeric reads in a dataset and split the sequences in non-chimeric sequences. Using the corrected sequences to generate mitochondrial genome assembly.

WGA - Chimera reads
Installation
Repository structure
Usage
Credits

WGA - Chimera reads

Multiple displacement amplification (MDA) is Whole Genome Amplification method to rapdily amplify DNA samples. It is very efficient to increase small amounts of DNA with a high genome coverage due to the strand displacement and a low error rate. The process starts by annealing random hexamer primers on the ssDNA template. The synthesis will start on multiple sites on the template DNA. Chain-elongation is mediated by polymerase phi 29. Polymerase phi 29 has a high proofreading and strong displacement activity. When phi 20 polymerase encounters a downstream primer, due to it's strong strand displacement activity, it will cause the downstream strand to be gradually displaced of it's 5'-end. The chain-elongation continous as multiple rounds of hexamer primers and polymeases are added to the newly generated ssDNA strands. The exponential growth of DNA, has branched structure generating clusters of DNA molecules.

In the process of MDA creating a branched structure, the displaced ssDNA strands goes in competition with the newly generated template. When the displaced strand re-attaches to the template, the newly generated strand at 3'-end falls off. What happens is the extended strand at 3' will bind with other secondary structures creating these chimera reads. Chimeras are multiple transcripts of DNA sequences joined toghete, also called split reads.

When the phi 29 polymerase of the newly generated strand attaches to another ssDNA displaced strand. The chain-elongation continous along the new template, creating a ssDNA of 2 or more regions that do not belong togehter. These ssDNA are inverted chimeras. It is also possible that the phi 29 polymerase binds with the original template in a region with a similar base sequence but not identical. It skips the elongation from the displaced 3'-end of ssDNA and to new annealed position of phi 29 polymerase.

Installation

Can start by cloning the repository to your local system either using the SSH but then you need a keypair on your local computer and add the public key to your account.

git clone git@github.com:GuillaumeLeBerreBIT/ChimereNanoporePipeline.git

Can also use the HTTPS link.

git clone https://github.com/GuillaumeLeBerreBIT/ChimereNanoporePipeline.git

Snakemake requires to have a conda installation on your system. The preferred conda distribution is mambaforge since it has the required python commands & mamba which is a very fast installation.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

When the download of miniconda is complete can start the local installation. The installation can be completly up to you where to install it on your system.

bash Miniconda3-latest-Linux-x86_64.sh

Snakemake will make use of the conda environment to install envs provided by yaml files. The configuration of the .condarc file is very important for Snakemake!

conda config --add channels conda-forge
conda config --add channels bioconda

The resulting .condarc file looks like this.

auto_activate_base: false
channels:
 - bioconda
 - conda-forge
 - defaults

WATCH OUT! If channel_priority is set to strict, it will not be able to do the conda installations. Can delete it from the .condarc file.

Snakemake can be instsalled using a conda. Mamba is possible to use as well if installed.

conda create -c conda-forge -c bioconda -n snakemake snakemake=7.25.0

Can check if snakemake is installed. The version should be 7.25 if installed using previous command.

conda activate snakemake
snakemake --help
snakemake --version

Snakemake uses multiple scripts, one of the scripts SACRA.sh uses multiple other scripts. Need to provide the path to the .bashrc file of the scripts folder.

nano ~/.bashrc

Can add the path to the scripts folder using PATH variable. The path is dependant on where the folder was cloned/placed. This is an example when the folder is placed in the "home" directory.

export PATH=$PATH:~/ChimereNanoporePipeline/workflow/scripts/

Need to source or restart terminal to configure the PATH variable.

source ~/.bashrc

Repository structure

To show an overview of how the respoitory is set up. The results of the pipeline will be collected in the results folder, containing subfolders dependant on the identifier. The config file used by snakemake is in the config , containing all different parameters used in the Snakefile. The Snakefile is present in the workflow folder. The workflow contain data generated by all different rules from the pipeline. The workflow folder has a scripts , envs folder which are used by different rules from the Snakefile.

├── Cmaenas_minion_data
├── config
├── resources
│ ├── DIAMOND-DB
│ ├── DIAMOND-Genes
│ ├── MITOS2-DB
│ ├── Styles
│ └── software
├── results
│ └── fastq_runid_211
├── tuning-params
│ ├── config
│ ├── input-file
│ └── scripts
└── workflow ├── AssemblyFasta ├── Diamond ├── FlyeResults ├── MITOS2Results ├── PorechopABI ├── ProwlerProcessed ├── SACRAResults ├── envs ├── scripts └── snakefile-templates

Usage

To run the pipeline, go to the workflow folder. From there can use the following command (not recommended). When running the pipeline for the first time it will need to install all the programs used, which is done by installing all dependencies through conda from a YAML-file. The conda installations can take a long time to install. NOTE! Activate the snakemake conda environment to run.

snakemake --use-conda

Depending the system using, will have a limited amount of resources. Can limit the amount of threads used, using the --jobs command. Using --jobs 8 will use 8 threads in parallel to perform snakemake, depending on your system can lower or make it higher.

snakemake --use-conda --jobs 8

Depending on the amount of cores can use need to lower it. It is very important to set the right amount of threads/jobs it can use. When the jobs is set differently then given here, NEED to change the amount of threads Porechop ABI rule can use. When there aren't enough threads available to perform the Porechop ABI rule or too many Porechop rules are running in parallell, results in the pipeline crashing.

Other programs when the amount of cores specified are not available will be able to scale down the processes.

Snakefile will use the config_snakemake.yaml with all different parameters used in from the pipeline. Before starting Snakemake, will need to specify the input folder by startfolder: containing all the fastq files to parse through the pipeline.

startfolder: "../Cmaenas_minion_data"

Using a unique identifier this for creating folder/file names. Using different names to save the rsults in different runs. WATCH OUT! Previous used identifiers where same files are still present may result in rewriting files or combine files from different experiments. An identifier can also be used as species identifier, to in the future add files of previously used species/files to get more results in addition to previous results.

identifier: "fastq_runid_211"

Credits

Credits to the authors of the different tools, that have been used for creating this pipeline.

Bio-informatics tool for adapter removal: Porechop ABI

Bio-informatics tool for quality trimming: ProwlerTrimmer

Bio-informatics tool for chimera removal: SACRA Paper

Bio-informatics tool for BLASTX: DIAMOND Paper

Bio-informatics tool for assembly: Flye

Bio-informatics tool for assembly: SPAdes

Bio-informatics tool for gene annotation: MITOS2

Code Snippets

shell:
    """
    Bandage image {input} {output} --names --lengths --depth --fontsize 4
    """

SnakeMake Bandage From line 16 of rules/Bandage.smk

shell: 
    """
    python3 scripts/ConcatFiles.py {params.fileSacra} {output} 
    """

SnakeMake From line 25 of rules/ConcatenateFiles.smk

shell: 
    """
    for db in ATP6 ATP8 COX1 COX2 COX3 CYTB NAD1 NAD2 NAD3 NAD4 NAD4L NAD5 NAD6; 
    do
        diamond blastx -d resources/DIAMOND-DB/"$db" -q {input} -k {params.k} -f {params.f} \
        -p 4 -o "Diamond/{params.folder}/{params.folder}Diamond_$db.csv"
    done
    """

SnakeMake Diamond From line 30 of rules/Diamond.smk

shell:
    """
    python3 scripts/DiamondToAssembly.py -i {params.idper} -l {params.length} -e {params.evalue} Diamond/{params.folder} {params.sacraF} {output}
    """

SnakeMake From line 27 of rules/FilteringDiamond.smk

shell:
    """
    python3 scripts/Filtering_SACRA_sequences.py -b {params.bases} {input} {output}
    """

SnakeMake From line 18 of rules/FilterSACRAFile.smk

shell: 
    """
    python3 scripts/StatisticalReportGenerator.py {output} {params.poreStat} {params.poreFastq} {params.prowFold} {input.sacraF} {input.BandageAssem} {params.mitosFold} 
    """

SnakeMake From line 33 of rules/GenerateReport.smk

shell:
    """
    unset _JAVA_OPTIONS
    runmitos.py -i {input} -c {params.gencode} -r {params.MitRefSeq} -o {params.MitosFolder} --refdir {params.RefFolder}
    """

SnakeMake MITOS From line 26 of rules/Mitos.smk

shell: 
    """
    porechop_abi -abi --no_split -i {input} -o {output.reads} -t 8 > {output.statistics} 2>&1
    """

SnakeMake Porechop_ABI From line 20 of rules/PreProcessing.smk

shell:
    """
    python3 scripts/TrimmerLarge.py -f {input.out_fastq} -i PorechopABI/{params.folder}/ -o ProwlerProcessed/{params.folder}/ \
    -m {params.trimmode} -c {params.clip} -g {params.fragments} -q {params.qscore} -w {params.windowsize} -l {params.minlen} \
    -d {params.datamax} -r '.fasta'
    """

SnakeMake From line 50 of rules/PreProcessing.smk

shell: 
    """
    scripts/SACRA.sh -i {input} -p {output} -t 4 -c ../config/MyConfig.yaml
    rm {params.blasttab}* {params.bck} {params.des} {params.prj} {params.sds} {params.ssp} {params.suf} {params.tis}
    """

SnakeMake SACRA From line 32 of rules/SACRA.smk

shell:
    """
    spades.py --only-assembler -s {input} -o {params.SpadesFolder} -t 8
    """

SnakeMake SPAdes From line 19 of rules/SPAdesAssembly.smk

import os, argparse, re

#####################################################################
# COMMAND LINE INPUT
#####################################################################
parser = argparse.ArgumentParser(description='Concatenate files')                                                         
parser.add_argument('inputFolFil', type=str, 
                    help='Provide the folder containing all the fastq files to be concatenated (or file of that folder).')
parser.add_argument('outputFile', type=str, 
                    help='Give a (path to and) name to call the outputfile.')
#parser.add_argument('targetNum', type=int, 
#                    help='Give a target File.')
args = parser.parse_args()

#####################################################################
# FILE HANDLING
#####################################################################
# Check wheter the path to the files is given OR a file from that folder
if os.path.isdir(args.inputFolFil):
    path_to_files = args.inputFolFil
else:
    # Splitting the file from the path
    splitted_path = os.path.split(args.inputFolFil)
    # Only use the path to move further
    path_to_files = splitted_path[0]

## Setting a flag to break/continue the loop
#flag = 0
## While the amount of total files is not present keep looping until all files are collected 
#while flag == 0:
#    # Changes when the folder of SACRA results has all files
#    if args.targetNum != len(os.listdir(path_to_files)):
#        flag = 0
#        #print(len(os.listdir(path_to_files)))
#    else:
#        # Loop breaks 
#        flag = 1

# Open a file to write everything in to
with open(args.outputFile, "w") as file_to_write:
    # Get every file from the directory
    for filename in os.listdir(path_to_files):
        # Extra failsafe to only get the resulting fasta after SACRA. After changing Snakefile lcoations of the resulting files not necassary anymore. Can delete this.  
        if not re.search(".csv",filename) and\
        not re.search(".non_chimera.fasta",filename) and\
        not re.search(".split.fasta", filename):

            # Paste the path to file togheter with the filename
            file_path = os.path.join(path_to_files, filename)
            # True == File
            if os.path.isfile(file_path):
                # Open the file 
                with open(file_path, 'r') as file_to_read:
                    # By reading in the file, can write the output each time to the outputfile. 
                    file_lines = file_to_read.read()
                    file_to_write.write(file_lines)

Python From line 11 of scripts/ConcatFiles.py

import os, argparse, csv, re
from Bio import SeqIO   # pip install biopython
import matplotlib.pyplot as plt
import numpy as np

#####################################################################
# COMMAND LINE INPUT
#####################################################################
parser = argparse.ArgumentParser(description='From DIAMOND to Assembly')
parser.add_argument('inputFolder', type=str, 
                    help='Give the (path to and) name of the folder containing the files after DIAMOND.')
parser.add_argument('sacraFiltered', type=str, 
                    help='Give the Fasta file containing the filtered reads on a certain threshold.')
parser.add_argument('outputAssembly', type=str, 
                    help='Give the (path to and) name for the fasta file with the output reads after filtering and DIAMOND mathces with DB.')
parser.add_argument('-i', '--identity', type=int, default = 70, required = False, 
                    help ='Give a number of the percentage identity as the minimum treshold.')
parser.add_argument('-l', '--len', type=int, default = 20, required = False, 
                    help ='Give a number of the min length of residues want to set as minimum treshold.')
# Will later convert it to a float -- > YAML sets it as an string
parser.add_argument('-e', '--evalue', type=str, default = 10e-6, required = False, 
                    help ='Give the minimum value to filter the sequences from DIAMOND.')
args = parser.parse_args()

#####################################################################
# FILE HANDLING
#####################################################################
# VARIABLES & LISTS (set before loop or will reset on each new file)
# Will create a list with all the headers validated by the filter parameters
filtered_list = []
# Need a list with all the crab families to match on 
crab_genus = ["Carcinus", "Charybdis", "Thalamita"]
# Testing for the Enoplea class >> Halalaimus
enoplea = ["Eucoleus","Longidorus","Trichinella", "Trichuris"]
# Testing for the Chromadorea class >> Acantholaimus & Desmoscolex
chromadorea = ["Gnathostoma", "Rhigonema", "Strongyloides", "Camallanus", "Meloidogyne"]
# Empty dictionary to save the amount of matches per gene
gene_matches = {}
# Empty dictionary for saving the headers
sequence_count = {} 

# The pipeline starts from the ChimereNanoporePipeline/ 
# Adding the folder containing the files after Diamond
# Do check if the user gave the slash with it at the end as well or not
if args.inputFolder[-1] == '/':    
    diamond_folder = args.inputFolder 
else:
    diamond_folder = args.inputFolder + "/"
# Using the unique identifier as name
splitted_folder = diamond_folder.split('/')
identifier = splitted_folder[1]

# Get each file from the list (13 files) and open each one of them to read its content. 
for file in os.listdir(diamond_folder):
    #Split the file name to get the gene name extracted >> Splitting will return a list
    splitted_gene = file.split("_")
    # Saving the gene name and then removing the .csv
    # Taking the last item from the list, since it can vary depending on name given.
    gene_name = splitted_gene[-1].replace(".csv","")
    # Want to for each gene a starting value of 0
    if gene_name not in gene_matches:
        gene_matches[gene_name] = 0
    else:
        continue

    # Link each file again with the full path to open it
    file_to_open = diamond_folder + file

    # Can open each csv file
    with open(file_to_open, "r") as csv_file:
        csv_reader = csv.reader(csv_file, delimiter='\t')
        # Each line from the csv file be returned as list
        # 0.  qseqid      query or source (gene) sequence id <-- 
        # 1.  sseqid      subject or target (reference genome) sequence id <-- 
        # 2.  pident      percentage of identical positions <--
        # 3.  length      alignment length (sequence overlap) <--
        # 4.  mismatch    number of mismatches
        # 5.  gapopen     number of gap openings
        # 6.  qstart      start of alignment in query
        # 7.  qend        end of alignment in query
        # 8.  sstart      start of alignment in subject
        # 9.  send        end of alignment in subject
        # 10.  evalue      expect value <--
        # 11.  bitscore    bit score
        for row in csv_reader:
            """
            print(f"Sequence ID: {row[0]}\
                  \nTarget Species_Gene: {row[1]}\
                  \nPercentage Identical positions: {row[2]}\
                  \nAlignement length: {row[3]}\
                  \nE-value: {row[10]}")
            """
            #################
            # FILTER 
            #################
            # First item of each row is the header name 
            header = row[0]
            # Need to split the sseqid to match on the Crab genus names
            splitted_genus = row[1].split("_")
            genus_name = splitted_genus[0]   
            gene = splitted_genus[2]
            # The float will function in the same way as the integers and can now compare numbers
            # Convert str -- > float/int
            # args.identity == Percentage Identity
            # args.len == Min length of residues 
            # args.evalue == Max e-value a sequence can have as treshold
            if float(row[2]) >= args.identity and\
                int(row[3]) >= args.len and\
                float(row[10]) <= float(args.evalue) and\
                genus_name in crab_genus:

                # Get all the headers that match the set filters
                # The headers that will be used to create FASTA file for assembly 
                # Add a '>' since the fasta files start with a '>' 
                filtered_list.append('>' + header)
                # For each matching gene name increase the value with 1
                gene_matches[gene] += 1
                # Dictionary for all headers, to see if reads are found in multiple genes or not 
                if header not in sequence_count:
                    sequence_count[header] = 1
                else:
                    sequence_count[header] += 1

# Need to know as well how many hits were not matched after the BLAST
# Rearead the input file from DIAMIND and loop over the headers to check if they are in the dictionary or not.  
# If they are not in the dictionary set to a value of 0   
for seq_record in SeqIO.parse(args.sacraFiltered, "fasta"):
    if seq_record.id not in sequence_count:
        sequence_count[seq_record.id] = 0

# Now need to reverse the logic and count the frequencys of dictionary per header.
# Then can see how much a sequence has not been matched or has been matched after BLASTING against 
# the genes from the database. 
header_freq = {}
for val in sequence_count.values():
    if val not in header_freq:
        header_freq[val] = 0
    else:
        header_freq[val] += 1

# Header list with all IDs            
#print(filtered_list)
# Hits per gene
#print(gene_matches)
# The count of how many times a read seen per hit 
#print(sequence_count)
# Can check now the freq of how much a header has been found in a dictionary
#print(header_freq)

#####################################################################
# WRITE TO FASTA
#####################################################################
# HANDLING THE FILE
# Opening a file to write to
with open(args.outputAssembly, "w") as file_to_write:
    # Opening a file to read to
    with open(args.sacraFiltered, "r") as file_to_read:
        # This will create a list based on the newlines, containing strings
        reading_lines = file_to_read.readlines()
        # Setting a counter
        counter_2 = 0
        counter_3 = 0
        counter_4 = 0
        # Setting a flag
        flag = 0 
        # Iterating over the lines in the list
        for line in reading_lines:

            # Check for header lines/ record IDs
            if re.search("^>", line):
                # Only want to retain the beginning of the line, this will be the only identical part of the line
                # Split on spaces
                splitted_header = line.split(" ")
                # Retain the ID:START-END == >@9030c343-3ff2-4b66-ab1b-c1e3327c3de0:1-188
                iso_header = splitted_header[0]
                # If the header is in the list set the flag to 1 == to write
                if iso_header in filtered_list:
                    counter_2 += 1
                    flag = 1
                # If the header is not in the list then set flag to 0 == not write
                elif iso_header not in filtered_list:
                    flag = 0

            # This flag will allow ot print the lines to a file
            if flag == 1:
                file_to_write.writelines(line)
                counter_3 += 1
            # When flag is 0 it will skip the lines. 
            elif flag == 0:
                counter_4 += 1
                continue
# Closing the files for good practice
    file_to_read.close()
file_to_write.close()

# To get information out the parsed reads.
"""
print(f"File handled for checking matching lines (manually): {counter_2}\
      \nLines printed {counter_3}\
      \nLines skipped {counter_4}")
"""
#####################################################################
# VISUALIZATION
#####################################################################
# Have a predefined width so the x-axis is more spread out
plt.figure(figsize=(10,5))
# PLOT HITS PER GENE
bar_colors = ['tab:red','tab:blue','tab:brown','tab:orange','tab:green','tab:purple','tab:cyan','tab:olive','tab:pink','yellow', 'forestgreen','darkred','khaki']
# The x-axis values from the dictionary plotted  per tick
plt.bar(range(len(gene_matches)), gene_matches.values(), align='center', color=bar_colors)
# The x-axis labels from the dictionary plotted 
plt.xticks(range(len(gene_matches)), gene_matches.keys())
# x-axis label
plt.xlabel('Genes')
# y-axis label
plt.ylabel('No. of hits')
# Title for the plot
plt.title('DIAMOND BLAST Results')
plt.savefig(f"../results/{identifier}/{identifier}Bar-HitsPerGene-DIAMOND&Filtering.png", dpi=200, bbox_inches='tight')
# Close the plot
plt.clf()

# REPORT HOW MANY TIMES HEADER BEEN MATCHED
with open(f"../results/{identifier}/{identifier}HeaderCountDIAMOND.txt","w") as tmp_to_write:
    for key, val in header_freq.items():
        tmp_to_write.writelines(f"For the headers found {key} time(s) after DIAMOND: {val} sequences.\n")

# HISTOGRAM LEN OF THE READS
assembly_seq_len = []
for seq_record in SeqIO.parse(args.outputAssembly, "fasta"):
    assembly_seq_len.append(len(seq_record))

# Convert the read length lists to numpy arrays for plotting
assembly_array = np.array(assembly_seq_len)

# Set a predefined picture size to save in format 
plt.figure(figsize=(7,5))
# Plot a histogram of sequence lengths after DIAMOND & Filtering BLAST matches
# Setting amount of bins & range of the graph. 
plt.hist(assembly_array, bins = 40, range = [min(assembly_array), 1000])
# Setting title, x and y labels. 
plt.title('Length of sequences after DIAMOND & Filtering')
plt.xlabel('Sequence length')
plt.ylabel('Frequency')
# Determining to show the interval of x-axis ticks. 
plt.xticks(np.arange(0, 1000, 100))
plt.savefig(f"../results/{identifier}/{identifier}Hist-SequenceLengthAfterDIAMOND&Filtering.png", dpi=200, bbox_inches='tight')
# Savefig does not close the plot. 
plt.clf()

Python numpy matplotlib Biopython From line 16 of scripts/DiamondToAssembly.py

import argparse
from Bio import SeqIO   # pip install biopython
#######################################
# COMMAND LINE INPUT
#######################################

parser = argparse.ArgumentParser(description='Filter length sequences')                                                         
parser.add_argument('inputFile', type=str, 
                    help='Give the input fasta file to filter the legth of the reads on. Can parse the path of file with it.')
parser.add_argument('outputFile', type=str, 
                    help='Give an output fasta file name to write to. Can parse the path of file with it.')
parser.add_argument('-b', '--bases', type=int, default = 50, required = False, 
                    help ='Give a number of bases want to have as minimum treshold to filter on.')
args = parser.parse_args()

############################## BIOPYTHON ##############################

# Parse the input file want to filter on
input_seq_iterator = SeqIO.parse(args.inputFile, "fasta")
# Using a generator expression, more memory efficient then using list comprehension creating list of records.
treshold_seq_iterator = (record for record in input_seq_iterator if len(record.seq) >= args.bases)
# Writes the wanetd sequences to the outputfile
SeqIO.write(treshold_seq_iterator, args.outputFile, "fasta")


"""
#######################################
# FILE HANDLING
#######################################
# GATHERING ALL IDS ABOVE X # OF NT
list_wanted_records = []
# Setting a counter
#counter_1 = 0
# Readigng in the input file 
for seq_record in SeqIO.parse(args.inputFile, "fasta"):
    # Filter on the length of the amount of bases smaller then X, default 50
    if len(seq_record) < args.bases:
        continue

    # Filter on the length of the amount of bases bigger then X, default 50
    elif len(seq_record) >= args.bases:
        # The records do not have the starting ">", so add it here so the parsing can be done much easier. 
        # Will have == >@9030c343-3ff2-4b66-ab1b-c1e3327c3de0:1-188
        list_wanted_records.append(">" + seq_record.id)
        #counter_1 += 1 

# HANDLING THE FILE
# Opening a file to write to
with open(args.outputFile, "w") as file_to_write:
    # Opening a file to read to
    with open(args.inputFile, "r") as file_to_read:
        # This will create a list based on the newlines, containing strings
        reading_lines = file_to_read.readlines()
        # Setting a counter
        #counter_2 = 0
        #counter_3 = 0
        #counter_4 = 0
        # Setting a flag
        flag = 0 
        # Iterating over the lines in the list
        for line in reading_lines:

            # Check for header lines/ record IDs
            if re.search("^>", line):
                # Only want to retain the beginning of the line, this will be the only identical part of the line
                # Split on spaces
                splitted_header = line.split(" ")
                # Retain the ID:START-END == >@9030c343-3ff2-4b66-ab1b-c1e3327c3de0:1-188
                iso_header = splitted_header[0]
                # If the header is in the list set the flag to 1 == to write
                if iso_header in list_wanted_records:
                    #counter_2 += 1
                    flag = 1
                # If the header is not in the list then set flag to 0 == not write
                elif iso_header not in list_wanted_records:
                    flag = 0

            # This flag will allow ot print the lines to a file
            if flag == 1:
                file_to_write.writelines(line)
                #counter_3 += 1
            # When flag is 0 it will skip the lines. 
            elif flag == 0:
                #counter_4 += 1
                continue
# Closing the files for good practice
    file_to_read.close()
file_to_write.close()

# To get information out the parsed reads.
#print(f"File readed through Biopython: {counter_1}\
#      \nFile handled for checking matching lines (manually): {counter_2}\
#      \nLines printed {counter_3}\
#      \nLines skipped {counter_4}")
"""

Python Biopython From line 11 of scripts/Filtering_SACRA_sequences.py

Version="2.0"
#################################

function parse_yaml {
  local prefix=$2
  local s='[[:space:]]*' w='[a-zA-Z0–9_]*' fs=$(echo @|tr @ '\034')
  sed -ne "s|^\($s\)\($w\)$s:$s\"\(.*\)\"$s\$|\1$fs\2$fs\3|p" \
  -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $1 |
  awk -F$fs '{
    indent = length($1)/2;
    vname[indent] = $2;
    for (i in vname) {if (i > indent) {delete vname[i]}}
    if (length($3) > 0) {
      vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
      printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
    }
   }'
}

while getopts ":i:p:t:c:" o; do
    case "${o}" in
        i)
            i=${OPTARG}
            ;;
        p)
            p=${OPTARG}
            ;;
        t)
            t=${OPTARG}
            ;;
        c)
            c=${OPTARG}
            ;;
        *)
            usage
            ;;
    esac
done
shift $((OPTIND-1))


if [ -z "${i}" ] || [ -z "${p}" ] || [ -z "${t}" ] || [ -z "${c}" ]; then
    echo "Usage: $0 [-i <input fasta file>] [-p <prefix>] [-t <max no. of cpu cores>] [-c <config.yml>]" 1>&2;
    exit 1;
fi

eval $(parse_yaml $c)

echo -e "***** [$0, Version: $Version] start " `date +'%Y/%m/%d %H:%M:%S'` " *****"
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] $0 -i $i -p $p -t $t -c $c\n"

########## STEP1 ##########
echo "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 1. alignment: All vs all pairwise alignment of long-read by LAST aligner"
makedb_cmd="lastdb8 -P $t -R $alignment_R -u $alignment_u $i $i"

echo "[`date +'%Y/%m/%d %H:%M:%S'`] $makedb_cmd"
eval $makedb_cmd

alignment_cmd="lastal8 -a $alignment_a -A $alignment_A -b $alignment_b -B $alignment_B -S $alignment_S -P $t -f $alignment_f $i $i > $i.blasttab"
echo "[`date +'%Y/%m/%d %H:%M:%S'`] $alignment_cmd"
eval $alignment_cmd
id=`grep -v "#" $i.blasttab  | shuf -n 1000 | awk '{m+=$3} END{print m/NR}'`
echo "[`date +'%Y/%m/%d %H:%M:%S'`] Average read-vs-read identity (%): $id"
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n"
###########################

########## STEP2 ##########
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 2. pars depth: Detecting the partially aligned reads (PARs)"
parsdepth_cmd="SACRA_PARs_depth.pl -i $i.blasttab -al $parsdepth_al -tl $parsdepth_tl -pd $parsdepth_pd -id $parsdepth_id > $i.blasttab.depth"
echo "[`date +'%Y/%m/%d %H:%M:%S'`] $parsdepth_cmd"
eval $parsdepth_cmd 
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n"
###########################

########## STEP3 ##########
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 3. cal pc ratio: Obtaining the PARs/CARs ratio (PC ratio) at the putative chimeric positions"
pcratio_cmd="SACRA_multi.pl $t $i.blasttab.depth"
echo "[`date +'%Y/%m/%d %H:%M:%S'`] $pcratio_cmd"
eval $pcratio_cmd
for k in `ls $i.blasttab.depth.split*`
do
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] SACRA_PCratio.pl -i $i.blasttab -pa $k -ad $pcratio_ad -id $pcratio_id > $k.pcratio"
    SACRA_PCratio.pl -i $i.blasttab -pa $k -ad $pcratio_ad -id $pcratio_id > $k.pcratio & 
done
# wait for all backgroud jobs to finish
wait

cat="cat $i*.depth.split*.pcratio > $i.blasttab.depth.pcratio"
echo "[`date +'%Y/%m/%d %H:%M:%S'`] $cat"
eval $cat
rm -rf $i*.depth.split*
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n"
###########################

########## STEP4 ##########
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 4. cal mPC ratio: Calculate mPC ratio based on spike-in reference genome"
if [ $mpc_sp = true ] && [ -e $mpc_rf ]; then
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] mPC ratio was calculated based on provided spike-in reference genome."
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] Spike-in reference genome: $mpc_rf"

    rfdb_cmd="lastdb8 -P $t -R $mpc_R -u $mpc_u $mpc_rf $mpc_rf"
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] $rfdb_cmd"
    eval $rfdb_cmd

    rfalign_cmd="lastal8 -a $mpc_a -A $mpc_A -b $mpc_b -B $mpc_B -S $mpc_S -P $t -f $mpc_f $mpc_rf $i > $i.spike-in.blasttab"
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] $rfalign_cmd"
    eval $rfalign_cmd
    grep -v "#" $i.spike-in.blasttab | awk -v "id=$mpc_id" '$3>=id' | awk -v "al=$mpc_al" '$4>=al' > $i.spike-in.blasttab.aligned
    awk '{print $1}' $i.spike-in.blasttab.aligned | sort | uniq > $i.spike-in.blasttab.aligned.id

    chimera_cmd="SACRA_detect_chimera.pl -i $i.spike-in.blasttab -id $mpc_id -al $mpc_al -lt $mpc_lt > $i.spike-in.blasttab.chimera"
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] $chimera_cmd"
    eval $chimera_cmd

    chimera_pos_cmd="SACRA_detect_chimeric_position.pl $i.spike-in.blasttab.chimera > $i.spike-in.blasttab.chimera.position"
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] $chimera_pos_cmd"
    eval $chimera_pos_cmd

    cal_mpc="SACRA_reference_PC.pl $i.spike-in.blasttab.aligned.id $i.blasttab.depth.pcratio $i.spike-in.blasttab.chimera.position"
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] $cal_mpc"
    eval $cal_mpc

    split_pc=`sort -k4,4nr $i.blasttab.depth.pcratio.mPC | awk '{print $1}' | sed 's/mPC=//g' | head -n 1`
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] Calculated mPC ratio: $split_pc"
    echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n"
else
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] Spike-in reference genome is not provided, so the mPC ratio in the config file was used."
    echo "[`date +'%Y/%m/%d %H:%M:%S'`] Calculated mPC ratio: $split_pc"
    echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n"
fi
###########################

########## STEP5 ##########
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 5. split: Split chimeras at the chimeric positions"
split_cmd="SACRA_split.pl -i $i.blasttab.depth.pcratio -pc $split_pc -dp $split_dp -sl $split_sl | awk '{split(\$0,a,\":\"); split(a[2],b,\"-\"); print a[1],b[1]-1,b[2]}' > $i.blasttab.depth.pcratio.faidx"
echo "[`date +'%Y/%m/%d %H:%M:%S'`] $split_cmd"
eval $split_cmd
seqtk subseq $i $i.blasttab.depth.pcratio.faidx > $i.split.fasta
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n"
###########################

########## Output ##########
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] Output final reads"
awk '{print $1}' $i.blasttab.depth.pcratio.faidx | uniq > $i.blasttab.depth.pcratio.faidx.id
seqkit grep -v -f $i.blasttab.depth.pcratio.faidx.id $i > $i.non_chimera.fasta
cat $i.non_chimera.fasta $i.split.fasta > $p

echo "[`date +'%Y/%m/%d %H:%M:%S'`] Split reads: $i.split.fasta"
echo "[`date +'%Y/%m/%d %H:%M:%S'`] Non-chimeras: $i.non_chimera.fasta"
echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] Combined reads: $p\n"
###########################

echo "***** [$0, Version: $Version] finished " `date +'%Y/%m/%d %H:%M:%S'` " *****"

Shell seqtk seqkit subSeq From line 5 of scripts/SACRA.sh

import re, os, argparse #, webbrowser, time
from Bio import SeqIO   # pip install biopython
import numpy as np      # pip install numpy
import matplotlib.pyplot as plt     # pip install matplotlib
import pandas as pd

############################# COMMAND LINE INPUT #############################

parser = argparse.ArgumentParser(description='Generate Statistical Report. The report will contain the results in graph, tables, text, figures fo all different input files/folders gathered throughout the Snakemake pipeline.')
parser.add_argument('outputFile', type=str, 
                    help='The name of the folder that will contain the output files.')                       
parser.add_argument('porechopStat', type=str, 
                    help='The folder containing the Porechop Statistic files. THe files are the output from the terminal printed in text files.')
parser.add_argument('porechopFastq', type=str, 
                    help='The folder containing the fastq files generated by Porechop ABI.')                                  
parser.add_argument('prowlerFolder', type=str, 
                    help='Give the folder containing the fasta files generated by Prowler Trimmer.')
parser.add_argument('sacraFiltered', type=str, 
                    help='Give the Fasta file containing the filtered sequences on a certain length.')
parser.add_argument('BandagePict', type=str, 
                    help='Give the assembly picture created in BANDAGE.')
parser.add_argument('MITOS2Folder', type=str, 
                    help='Give the folder with the MITOS2 output folders.')
args = parser.parse_args()

## PARSING UNIQUE IDENTIFIER
# Use the unique identifier per sample 
uniquelabel = args.outputFile

# Will use the unique label, ued to create a folder as identifier. Split on paths ('/') and get the unique identifier name out of the list. 
splitted_label = uniquelabel.split("/")
identifier = splitted_label[2]

############################# HANDLING FILES #######################################
# This section will contain each of the files used for statistical summary;

############################# PORECHOP REPORT  #######################################
## TERMINAL OUTPUT
# Empty list == The list used to gather all the lines containing relevant information from the terminal 
statistics_list = []

# Go over all the files in the directory
for file in os.listdir(args.porechopStat):

    # Combining the path to the file + file name -- > To open the file
    file_path = f"{args.porechopStat}/{file}" 
    # Opening the file from the current location 
    with open(file_path, "r") as text_file:
        # Readlines returns a list of lines split on newline character == "\n"
        splitted_file = text_file.readlines()

        # Flag == Define when to print/parse or not. 
        flag = 0 

        for line in splitted_file:
            # Setting up the different conditions to handle the output
            # When the re.search matches the pattern of a line it will return == True if not == False
            if re.search("Trimming adapters from read ends", line):
                flag = 1
            elif re.search("adapters trimmed from their", line):
                flag = 2
            elif re.search("reads were split based on middle adapters", line):
                flag = 3
            # When encoutering a newline flag == 0 == Not add the line to the list
            elif re.search("^\n", line):
                flag = 0

            # If any of the flags match the set value == strip the line on unwanted characters & save the line in a list
            if flag == 1 or flag == 2 or flag == 3:

                # Removing leading or trailing characters
                stripped_line = line.strip()
                # Add each stripped line to the list 
                statistics_list.append(stripped_line)     


# Empty list == Removing special characters from the lines. 
cleaned_text_list = []

for text in statistics_list:
    # If any of the patterns match in the item from statistics_list then replace it by nothing == Removing unwanted characters. 
    cleaned_text = re.sub(r'\x1b\[1m\x1b\[4m|\x1b\[0m|\x1b\[31m', '', text)
    # If the filtered item has nothing == Delete/Skip
    # Else keep it and add to a list 
    if cleaned_text == '':
        continue
    else: 
        cleaned_text_list.append(cleaned_text)   

## ADAPTERS

# Define an empty set == Will keep unique items only in a list (no duplicates). 
adapters = set()
# Save all the statistical stats of trimmed reads in a list. 
trimmed_count = []
# Loop over the list that has been fully cleaned/stripped
for item in cleaned_text_list:
    # ADAPTER LINES == ":"
    if re.search(":", item):
        adapters.add(item)
    # REMOVED READS STATS == "/"
    elif re.search("/", item):
        trimmed_count.append(item)
# Need to extract now Start - Mid - End
start = []
middle = []
end = []

# Gather the lines containg start - middle - end in each respective list. 
for item in trimmed_count:

    if re.search("start", item):
        start.append(item)
    elif re.search("middle", item):
        middle.append(item)
    elif re.search("end", item):
        end.append(item)

StartAdapRem = 0 
StartTotReads = 0
StartBasePairs = 0

# Split each item == sentence -- > Get the numbers only to generate a summary of all files togheter. 
for s in start:
    splitted_start = s.split(" ")
    # [0] == Nr reads adapters removed & [2] == Total Reads & [10] == BP
    StartAdapRem += int(splitted_start[0].replace(",",""))
    StartTotReads += int(splitted_start[2].replace(",",""))
    StartBasePairs += int(splitted_start[10].replace(",","").replace("(",""))

#print(f"{StartAdapRem}-{StartTotReads}-{StartBasePairs}")

MidAdapRem = 0 
MidTotReads = 0

# Split each item == sentence -- > Get the numbers only to generate a summary of all files togheter.
for m in middle:
    splitted_mid = m.split(" ")
    # [0] == Nr reads adapters removed & [2] == Total Reads 
    MidAdapRem += int(splitted_mid[0].replace(",",""))
    MidTotReads += int(splitted_mid[2].replace(",",""))

#print(f"{MidAdapRem}-{MidTotReads}")

EndAdapRem = 0 
EndTotReads = 0
EndBasePairs = 0

# Split each item == sentence -- > Get the numbers only to generate a summary of all files togheter.
for e in end:
    splitted_end = e.split(" ")
    # [0] == Nr reads adapters removed & [2] == Total Reads & [10] == BP
    EndAdapRem += int(splitted_end[0].replace(",",""))
    EndTotReads += int(splitted_end[2].replace(",",""))
    EndBasePairs += int(splitted_end[10].replace(",","").replace("(",""))

#print(f"{EndAdapRem}-{EndTotReads}-{EndBasePairs}")

############################# PROWLER REPORT #############################
# Empty lists to store the read lengths before and after trimming
before_trim = []
after_trim = []

# Biopython library can handle fastq & fasta files. 
# Makes it much more easier to read in files and can filter on what part of each sequence you want == .seq, .id 
# Much faster then writing it yourself. 
# Read the input fastq file and append the length of each read to before_trim list. 
for pore_file in os.listdir(args.porechopFastq):
    # Concat the filepath == To open the file
    file_path_pore = f"{args.porechopFastq}/{pore_file}"

    for seq_record in SeqIO.parse(file_path_pore, "fastq"):
            # The length of each read
            before_trim.append(len(seq_record.seq))

# Read the trimmed fasta file and append the length of each read to after_trim list
for prow_file in os.listdir(args.prowlerFolder):   
    # Due to other files present (output SACRA), only want the fasta file before SACRA. 
    if not re.search(".csv",prow_file) or\
        re.search(".non_chimera.fasta",prow_file) or\
        re.search(".split.fasta", prow_file):

        file_path_prow = f"{args.prowlerFolder}/{prow_file}"

        for seq_record in SeqIO.parse(file_path_prow, "fasta"):
            after_trim.append(len(seq_record.seq))

# Convert the read length lists to numpy arrays >> Plots require numpy array == [[...]]
before_array = np.array(before_trim)
after_array = np.array(after_trim)

# Create a figure with two subplots + tight layout
fig, axs = plt.subplots(1, 2, tight_layout=True, figsize = (10,5))

# Plot a histogram of read lengths before trimming + setting number of bins + setting the range of x axis
axs[0].hist(before_array, bins = 40, range = [0,10000])
axs[0].set_title('Before trimming Reads')
axs[0].set_xlabel('Read length')
axs[0].set_ylabel('Frequency')

# Plot a histogram of read lengths before trimming + setting number of bins + setting the range of x axis
axs[1].hist(after_array, bins = 40, range = [0,10000])
axs[1].set_title('After trimming Reads')
axs[1].set_xlabel('Read length')
axs[1].set_ylabel('Frequency')

# Saving the graph picture. 
plt.savefig(f"../results/{identifier}/{identifier}Before&After-Prowler.png", dpi=200, bbox_inches='tight')
# Savefig does not close the plot. >> clf = close
plt.clf()

############################# SACRA REPORT #############################
# Defining empty lists beforehand == Amount of records
prow_records = []
chim_records = []
nonchim_records = []
# Gather seq len from Non-Cimera and Chimera reads -- > After SACRA 
sacra_seq_len = []

### PROWLER SEQUENCES
# Loop over the files in the folder. 
for prow_file in os.listdir(args.prowlerFolder):   
    # Want one specif file == .fasta
    if not re.search(".csv",prow_file) and\
        re.search(".non_chimera.fasta",prow_file) and\
        re.search(".split.fasta", prow_file):

        file_path_prow = f"{args.prowlerFolder}/{prow_file}"
        # Reading in the Prowler Fasta file 
        for seq_record in SeqIO.parse(file_path_prow, "fasta"):
            # Append each record to a list
            prow_records.append(seq_record.id)

# Counting the number of IDs == Number of sequences >> total number of sequences present in the fasta file.
count_prow = len([record for record in prow_records])

### CHIMERA SEQUENCES
# Loop over the files in the folder.
for chim_file in os.listdir(args.prowlerFolder):

    if re.search(".split.fasta", chim_file):

        file_path_chim = f"{args.prowlerFolder}/{chim_file}"
        # Reading in the Fasta file with Chimere sequences
        for seq_record in SeqIO.parse(file_path_chim, "fasta"):
            # Append each record/header to a list
            chim_records.append(seq_record.id)
            # Gathering the read lengths. >> Want all read lengths (Chimera sequences multiple headers same BUT different Start - End position added)
            sacra_seq_len.append(len(seq_record.seq))

# Counting the number of IDs == Number of sequences >> total number of sequences present in the fasta file.
count_chim = len([record for record in chim_records])

### UNIQUE CHIMERA SEQUENCE ID
# Will try to count and see how many reads had one/multiple chimera reads == UNIQUE HEADERS!! >> .split.fasta
# Setting empty list beforehand
record_splitted = []
# Iterating over the collected chimera sequence records
for chim in chim_records:
    # Split each item >> HEADER:START-END == Want only the first part to find unique headers
    chim_splitted = chim.split(":")
    # Only ID part retained
    record_splitted.append(chim_splitted[0])
# Set an empty set >> Set == only unique IDs
unique_chim = set()
# Iterate over the sequence IDs
for i in record_splitted:
    # Add to the set of IDs
    unique_chim.add(i)
# Counting the number of IDs == Number of UNIQUE sequences
count_unique_chim = len([record for record in unique_chim])

### NON CHIMERA SEQUENCES
for chim_file in os.listdir(args.prowlerFolder):

    if re.search(".non_chimera.fasta", chim_file):

        file_path_non_chim = f"{args.prowlerFolder}/{chim_file}"
        # Non chimera sequences were not splitted or so thus headers remain unique
        # Reading in the SACRA fasta file with Non Chimere reads
        for seq_record in SeqIO.parse(file_path_non_chim, "fasta"):
            nonchim_records.append(seq_record.id)
            # Gathering the read lengths. 
            sacra_seq_len.append(len(seq_record.seq))

# Counting the number of IDs == Number of reads
count_nonchim = len([record for record in nonchim_records])

# Informative print of the results gathered
#print(f"Reads after Prowler: {count_prow}\
#      \nHow many chimera sequences: {count_chim}\
#      \nHow many from original sequence, are unique: {count_unique_chim}\
#      \nHow many non chimera sequences: {count_nonchim}\
#      \nSum chimera unique IDs & non chimera IDs: {(total_sacra_seq := count_unique_chim + count_nonchim)}")

### VISUALIZATION
### RAW RESULTS
# Creating a pandas dataframe >> Parse to matplotlib
sacraDf = pd.DataFrame([["No. sequences", count_unique_chim, count_nonchim]],
                       columns = ["Amount", "Chimera", "Non chimera"])
# Plotting the rows  of the No. sequences per bar.
# Setting the width of the bars bit smaller. 
# Adding colormap for visualization. 
ax = sacraDf.plot(x='Amount', kind='bar', stacked=True, width = 0.2,
                colormap = "Set3",  
                title='Total amounft of chimera and non-chimera sequences after SACRA')

# Iterating over the patches to obtain the width and height + x and y coordinates
# Using the x, y coordinates to place the label in the center of corresponding bar
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    # labelling text based on gathered positions. 
    ax.text(x+width/2, 
            y+height/2, 
            '{:.0f}'.format(height), 
            horizontalalignment='center', 
            verticalalignment='center')

# For some reason have to set the ticks to 0 to get the label horizontally. 
plt.xticks(rotation=0)
# Legend location to the upper right
# bbox anchor is to change the location, placed it outside the box, bbox_to_anchor(x,y)
plt.legend(loc = 'upper right', bbox_to_anchor=(1.4, 0.95))
# Shrink current axis by 20% (x-axis)
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Label y-axis
plt.ylabel("No. of sequences")
# Saving the picture, using bbox_inches
plt.savefig(f"../results/{identifier}/{identifier}SACRA-Stacked-Seq-Amount.png", dpi=200, bbox_inches='tight')
# Savefig does not close the plot. 
plt.clf()

### RELATIVE RESULTS
# Calculations Relative Amount oc sequences. 
total_sacra_seq = count_unique_chim + count_nonchim
rel_unique_chim = (count_unique_chim / total_sacra_seq) * 100
rel_nonchim = (count_nonchim / total_sacra_seq) * 100
# Creating a pandas dataframe. >> Parse to matplotlib
sacraDf = pd.DataFrame([["No. sequences", rel_unique_chim, rel_nonchim]],
                       columns = ["Amount", "Chimera", "Non chimera"])
#Plotting the rows  of the No. sequences per bar.
# Setting the width of the bars bit smaller. 
# Adding colormap for visualization. 
ax = sacraDf.plot(x='Amount', kind='bar', stacked=True, width = 0.2,
                colormap = "Set3",  
                title='Total amount of chimera and non-chimera sequences after SACRA')
# Iterating over the patches to obtain the width and height + x and y coordinates
# Using the x, y coordinates to place it in the center of corresponding bar
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    # labelling text based on gathered positions. 
    ax.text(x+width/2, 
            y+height/2, 
            '{:.2f} %'.format(height), 
            horizontalalignment='center', 
            verticalalignment='center')

# For some reason have to set the ticks to 0 to get the label horizontally. 
plt.xticks(rotation=0)
# Shrink current axis by 20%
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Legend location to the upper right
# bbox anchor is to change the location, placed it outside the box, bbox_to_anchor(x,y)
plt.legend(loc = 'upper right', bbox_to_anchor=(1.4, 0.95))
# Label y-axis
plt.ylabel("No. of sequences (%)")
# Saving the created plot as .png, using the dpi to set the size of the figure. bbox_inches makes sure everything is saved in the canvas. 
plt.savefig(f"../results/{identifier}/{identifier}SACRA-Stacked-Seq-Rel-Amount.png", dpi=200, bbox_inches='tight')
# Savefig does not close the plot. 
plt.clf()
### HISTOGRAM LENGTH READS
# Convert the read length lists to numpy arrays for plotting
sacra_array = np.array(sacra_seq_len)

# Plot a histogram of sequence lengths after SACRA
# Setting amount of bins & range of the graph. 
plt.hist(sacra_array, bins = 40, range = [min(sacra_array), 1000])
# Setting title, x and y labels. 
plt.title('Sequence lengths after SACRA')
plt.xlabel('Sequence length')
plt.ylabel('Frequency')
# Determining to show the interval of x-axis ticks. 
plt.xticks(np.arange(0, 1000, 100))
# Saving the figure in .png format. bbox_inches makes sure everything is saved in the canvas.
plt.savefig(f"../results/{identifier}/{identifier}SACRA-Hist-Distribution.png", dpi=200, bbox_inches='tight')
# Savefig does not close the plot. 
plt.clf()

#######################################
# SACRA FILTERING STEP
#######################################
# Lists to store the sequence lengths after the filtering of fasta file on certain length. 
filtered_sacra = []
# Read the input fastq file and append the length of each sequence to before_trim list
for seq_record in SeqIO.parse(args.sacraFiltered, "fasta"):
    filtered_sacra.append(len(seq_record.seq))
# Convert the read length lists to numpy arrays for plotting
before_array = np.array(filtered_sacra)

# Plot a histogram with a predefined number of bins & a range set from x1 to x2.
plt.hist(before_array, bins=40, range=[0,1000])
# Setting the title
plt.title('Filtering on a treshhold of ' + str(min(filtered_sacra)) + ' bases')
# X-axis label
plt.xlabel('Read length')
# Setting y-axis label
plt.ylabel('Frequency')
# Detrmining to show the interval of x-axis ticks. 
# np.arange to go from a to b in x amount of steps. 
plt.xticks(np.arange(0, 1000, 100))
# Saving the figure 
plt.savefig(f"../results/{identifier}/{identifier}SACRA-Hist-FilteredSeq.png", dpi=200, bbox_inches='tight')
# Close the plot
plt.clf()

#######################################
# GENERATING HTML REPORT
#######################################
# Creating the header line
html_header = f"<! DOCTYPE html>\n<html lang='en'>\n<head>\n<meta charset='UTF-8'>\n<title>Statistical Report</title>\n<link rel='stylesheet' href='../../resources/Styles/styles.css'>\n</head>\n<body>\n<div class='main'>\n<h1 id='statisticalReport'>Statistical Report - Workflow</h1>\n<h2 id='porechopABI'>Porechop ABI</h2>\n<h3>Trimming adapters from read ends</h3>\n"

# Creating the table containg everything
# The last 3 are summary lines from the adapters trimmed so can parse everything until then
#unorder_list = []
#unorder_list.append("<ul>\n")
#for adapter in adapters:
#
#    unorder_list.append(f"\t<li>{adapter}</li>\n")
#unorder_list.append("</ul>\n")

# Header adapter loc
adap_loc_header = "<h3>Adapters Removed</h3>\n"

# Adding the last line to the file
html_end = "</body>\n</html>"

# WRITING TO THE FILE
# The location of the outputfile
statisticalFile = args.outputFile

#Opening the outputfile to write to
with open(statisticalFile, "w") as html_file:
    # PORECHOP ABI
    html_file.writelines(html_header)
    # Adapters
    html_file.write("<table>\n\t<tr>\n\t\t<th>Adapter</th>\n\t\t<th>Sequence</th>\n\t</tr>\n")
    # Loop over the list containing all the adapeters. 
    for adapter in adapters:
        splitted_adapter = adapter.split(":")
        html_file.writelines(f"\t<tr>\n\t\t<td>{splitted_adapter[0]}</td>\n\t\t<td>{splitted_adapter[1]}</td>\n\t</tr>\n")
    html_file.writelines("</table>\n")

    html_file.writelines(adap_loc_header)
    #Removed adapters
    html_file.writelines(f"\t<div class='centerdiv'>{StartAdapRem} / {StartTotReads} reads had adapters trimmed from their start ({StartBasePairs} bp removed)</div>")
    html_file.writelines(f"\t<div class='centerdiv'>{MidAdapRem} / {MidTotReads} reads had adapters trimmed from their middle </div>")
    html_file.writelines(f"\t<div class='centerdiv'>{EndAdapRem} / {EndTotReads} reads had adapters trimmed from their end ({EndBasePairs} bp removed)</div>")
    # PROWLER
    html_file.writelines("<h2 id='prowler'>Prowler Trimming</h2>\n")
    # Have to set the locatio from where the html file will be, so set picture in same folder
    html_file.writelines(f"\t<img src='{identifier}Before&After-Prowler.png' height='800px'>\n")

    # SACRA
    html_file.writelines("<h2 id='sacra'>Split Amplified Chimeric Read Algorithm (SACRA)</h2>\n")
    # Have to set the locatio from where the html file will be, so set picture in same folder
    html_file.writelines(f"\t<img src='{identifier}SACRA-Stacked-Seq-Amount.png' height='800px'>\n")
    html_file.writelines(f"\t<img src='{identifier}SACRA-Stacked-Seq-Rel-Amount.png' height='800px'>\n")
    html_file.writelines(f"\t<img src='{identifier}SACRA-Hist-Distribution.png' height='800px'>\n")
    # The statistical ouput after filtering on certain amount of bases
    html_file.writelines("<h2 id ='sacrafilt'>SACRA Filtered Reads</h2>\n")
    html_file.writelines(f"\t<img src='{identifier}SACRA-Hist-FilteredSeq.png' height='800px'>\n")

    ######################### DIAMOND #########################
    html_file.writelines("<h2 id='diamond'>DIAMOND</h2>\n")
    # Read the file containing the amount a hit has been found in the genes
    with open(f"../results/{identifier}/{identifier}HeaderCountDIAMOND.txt", "r") as text_reader:
        lines_read = text_reader.readlines()
        # Write the output in an unordered list
        html_file.writelines("<ul>\n")
        for line in lines_read:
            # Write each newline in a list item, strip the line for \n. 
            html_file.writelines(f"\t<li>{line.strip()}</li>\n")
        html_file.writelines("</ul>\n")
    # Writing the pictures to the html file. 
    html_file.writelines(f"\t<img src='{identifier}Bar-HitsPerGene-DIAMOND&Filtering.png' height='800px'>\n")
    html_file.writelines(f"\t<img src='{identifier}Hist-SequenceLengthAfterDIAMOND&Filtering.png' height='800px'>\n")
    #Add the end of html to the file. 
    html_file.writelines(html_end)

    ######################### FLYE - BANDAGE #########################
    html_file.writelines("<h2 id='bandage'>SPAdes - BANDAGE</h2>\n")
    # Split the path of the picture to head and tail part == Tail is Picture
    splitted_Bandage = os.path.split(args.BandagePict)
    # Can call the last item in the list which is the Picture (OR [1])
    html_file.writelines(f"\t<img src='{splitted_Bandage[-1]}' height='800px'>\n")

    ######################### MITOS2 ANNOTATION REPORT ###########################

    html_file.writelines("<h2 id='mitos'>MITOS2</h2>\n")

    # Filter the list to open files in numerical order. 
    list_folders = os.listdir(args.MITOS2Folder)
    list_folders.sort()
    # Empty contig list to create titles for navbar
    contig_headers = []
    # The IDS to link the headers to  
    ids_headers = []
    # Need to specify the full path to find the specific folder
    for folder in list_folders:
        #print(f"{folder}")
        # Have to give the full path with it to find the folder
        for file in os.listdir(f"{args.MITOS2Folder}/{folder}/"):
            ## GFF
            if re.search("result.gff",file):

                with open(f"{args.MITOS2Folder}/{folder}/{file}", "r") as file_to_read:
                    # Create a list of lines from the file
                    file_lines = file_to_read.readlines()
                    # Have to check wheter list is empty or not
                    if len(file_lines) == 0:
                        # If empty it will return to the test
                        continue
                    # The title by getting the contig name
                    splitted_item = file_lines[0].split("\t")
                    contig_raw = splitted_item[0]
                    # Edit word
                    contig_list = contig_raw.split("_")
                    # Format a title and save it into a variable
                    formatted = contig_list[0].capitalize() + " " + contig_list[1]
                    # Add the variable to the list 
                    contig_headers.append(formatted)
                    # Create string for the ids to link to
                    ids_headers.append(contig_list[1])
                    # Write title == Contig 
                    html_file.write(f"<h3 id='{contig_list[1]}'>{contig_list[0].capitalize()} {contig_list[1]}</h3>\n")
                    # Write the table header to the file
                    html_file.write("<table>\n\t<tr>\n\t\t<th>Source</th>\n\t\t<th>Feature</th>\n\t\t<th>Start position</th>\n\t\t<th>End position</th>\n\t\t<th>Score</th>\n\t\t<th>Strand</th>\n\t\t<th>Frame</th>\n\t\t<th>Attributes</th>\n\t</tr>\n")
                    # Iterate over the lines from the list
                    for line in file_lines:
                        # The columns are tab seperated o split on
                        splitted_gff = line.split("\t")
                        # 1) seqname - name of the chromosome or scaffold
                        # 2) source - name of the data source 
                        # 3) feature - feature type 
                        # 4) start - Start position
                        # 5) end - End position
                        # 6) score - A floating point value.
                        # 7) strand - defined as + (forward) or - (reverse).
                        # 8) frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
                        # 9) attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
                        #print(splitted_gff)
                        # Parse everything to a table in the file. 
                        html_file.write(f"\t<tr>\n\t\t<td>{splitted_gff[1]}</td>\n\t\t<td>{splitted_gff[2]}</td>\n\t\t<td>{splitted_gff[3]}</td>\n\t\t<td>{splitted_gff[4]}</td>\n\t\t<td>{splitted_gff[5]}</td>\n\t\t<td>{splitted_gff[6]}</td>\n\t\t<td>{splitted_gff[7]}</td>\n\t\t<td>{splitted_gff[8].strip()}</td>\n\t</tr>\n")
                    html_file.write("</table>\n")
                file_to_read.close()

            ## FASTA   
            # Only open the file name that matches -- > Will be always the same
            if re.search("result.fas", file):
                # Open the file to read from
                with open(f"{args.MITOS2Folder}/{folder}/{file}", "r") as file_to_read:
                    # Get the lines of a file split on newline in a list. 
                    file_lines = file_to_read.readlines()
                    #Set a counter for each file to handle
                    count = 0

                    for line in file_lines:
                        count += 1
                        # To write the first div
                        if re.search("^>", line) and count == 1:
                            html_file.write(f"\t<div class='contigpadd'>\n\t\t{line}\n\t<pre>\n")
                        # To write the divs except first and last
                        elif re.search("^>", line) and count != 1:
                            html_file.write(f"\t</pre>\n\t</div>\n\t<div class='contigpadd'>\n\t\t{line}\n\t<pre>\n")
                        # Write the sequence lines
                        elif re.search("^[A,G,C,T,U]", line):
                            # Tab in pre statement will directly visualized. 
                            html_file.write(f"\t{line}")

                    # To write the last div block
                    if count == len(file_lines):
                        html_file.write(f"\t</pre>\n\t</div>\n")
                # Close file
                file_to_read.close()

    html_file.write(f"</div>\n")
    # Write a div to a file that will contain the part of the sidebar. 
    # Write every line connected to a title to the sidebar on the side. 
    html_file.write(f"<div class='sidenav'>\n")
    # Only style the first differently to come more out as a header. 
    html_file.write(f"<a class='header' href='#statisticalReport'>Statistical Report</a>\n")
    html_file.write(f"<a href='#porechopABI'>Porechop ABI</a>\n")
    html_file.write(f"<a href='#prowler'>Prowler Trimming</a>\n")
    html_file.write(f"<a href='#sacra'>SACRA</a>\n")
    html_file.write(f"<a href='#sacrafilt'>SACRA Filtered</a>\n")
    html_file.write(f"<a href='#diamond'>DIAMOND</a>\n")
    html_file.write("<a href='#bandage'>SPAdes - BANDAGE</a>\n")
    html_file.write(f"<a href='#mitos'>MITOS2</a>\n")
    # Loop over 2 lists one that contains the links to the titles & one with the visible title in the navbar
    for ids, item in zip(ids_headers, contig_headers):
        html_file.write(f"<a class='subtitle'href='#{ids}'>{item}</a>\n")
    # Close the div 
    html_file.write(f"</div>'>\n")

    #Close the file
    html_file.close()

# Close the file that has been written to
html_file.close()

Python Pandas numpy matplotlib Biopython Diamond Bandage From line 23 of scripts/StatisticalReportGenerator.py

import Fastq as fq
import time as tm
import sys, os
from optparse import OptionParser

# -6 removes the ".fastq"
def generateOutputFilename(filename, outFolder, trimSpecs, Qcutoff, windowSize, \
                           minLen, maxDataMB, suffix):
  filebase = os.path.basename(filename)
  outputFileName = filebase[0:-6] + suffix
  outputFileName += trimSpecs + str(Qcutoff)
  outputFileName += "W" 
  outputFileName += str(windowSize) 
  outputFileName += "L" + str(minLen)
  outputFileName += "R" + str(maxDataMB)
  outputFileName = outFolder + '/'+ outputFileName
  return outputFileName

def writeFile(filename, outFolder, seqs, trimSpecs, Qcutoff, windowSize, minLen, \
              maxDataMB, suffix, outMode=".fastq"):
  # add code suffixes to output file name for mode(D__/S__), window size (W__),
  # minimum Length of read (L__), and sequences selected from raw file (R[__-__)

  #generates the output filename from sourcefilename and specs
  outputFileName = generateOutputFilename(filename, outFolder, trimSpecs, Qcutoff, \
                                          windowSize, minLen, maxDataMB, \
                                          suffix)

  if outMode == ".fastq":
    fq.writeFastq(seqs,outputFileName)
  elif outMode == ".fasta":
    fq.writeFasta(seqs,outputFileName)
  else:
    print("error. output file type not supported")

# clips ends according to clipEnds seeting
def clipFunction(clipEnds,seqs):
  if clipEnds == "LT":
    fq.removeAllLeadingAndTrailingNs(seqs)
  elif clipEnds == "L":
    fq.removeAllLeadingNs(seqs)
  elif clipEnds == "T":
    fq.removeAllTrailingNs(seqs)

# Trims contig list based on min len and desired #fragments  
def fragFunction(seqs, minLen, numFrags, fragMode):
  if fragMode == "F":
    seqs = fq.seqsToFragments(seqs, minLen, numFrags)    
  return seqs

def analyzeFastqFile(fileLines,minLen):
  seqQty = int(len(fileLines)/4)
  seqs = [None]*seqQty
  for i in range(seqQty):
#    print("i:",i)
    seqs[i] = fq.Fastq(fileLines[i*4:i*4+4],minLen)
  return seqs

def readBuffer(f, bufferSize):
  fileLines = f.readlines(bufferSize)
  extraLines = (4-len(fileLines)%4)%4
  for j in range(extraLines):
    fileLines.append(f.readline())
#  print(len(fileLines)/4)

  return fileLines

if __name__ == "__main__":

  #use homePC=1 when running in spyder, use 0 when running on HPC
  runMode = 0

  if runMode == 0:
    parser = OptionParser()
    parser.add_option("-f", "--file", dest="filename",
                  help="the name of the file you want to trim, wihtout the folderpath", metavar="FILE")
    parser.add_option("-i", "--infolder", dest="inFolder",
                  help="the folderpath where your file to be trimmed is located (default = cwd)", metavar="DIRECTORY", default="")
    parser.add_option("-o", "--outfolder", dest="outFolder",
                  help="the folderpath where your want to save the trimmed file (default = cwd)", metavar="DIRECTORY", default="")
    parser.add_option("-w", "--windowsize", dest="windowSize",
                  help="change the size of the trimming window (default= 100bp)", metavar="INT", default=100)
    parser.add_option("-l", "--minlen", dest="minLen",
                  help="change the minimum acceptable numer of bases in a read (default=100)", metavar="INT", default=100)
    parser.add_option("-m", "--trimmode", dest="mode",
                  help="select trimming algorithm: S for static  or D for dynamic (default=)", metavar="[S/D]", default="S")
    parser.add_option("-q", "--qscore", dest="Qcutoff",
                  help="select the phred quality score trimming threshold (default=7)", metavar="INT", default=7)
    parser.add_option("-d", "--datamax", dest="maxDataMB",
                  help="select a maximum data subsample in MB (default=0, entire file))", metavar="INT", default=0)
    parser.add_option("-r", "--outformat", dest="outMode",
                  help="select output format of trimmed file (fastq or fasta) (default=.fastq)", metavar="[.fasta/.fastq]", default=".fastq")
    parser.add_option("-c", "--clip", dest="clipping",
                  help="select L to clip leading Ns, T to trim trialing Ns and LT to trim both (default=LT)", metavar="[L/T/LT]", default="LT")
    parser.add_option("-g", "--fragments", dest="fragments",
                  help="select fragmentation mode (default=U0)", metavar="[U0/F1/F2...]", default="U0")
    (options, args) = parser.parse_args()

    #the source file that will be trimmed
    filename = options.filename

    #folder in which the source file is located
    inFolder = options.inFolder

    #folder where the trimmed file will be saved
    outFolder = options.outFolder

    #size of the trim window
    windowSize = int(options.windowSize)

    #minimum length of each fastq sequence
    minLen = int(options.minLen)

    #trim mode (D=dynamic, S=static)
    mode = options.mode

    #Phred Quality score threshold
    Qcutoff = int(options.Qcutoff)

    #number of sequences to analyze (0=all)
    maxDataMB = int(options.maxDataMB)

    #u\output format of the trimmed file (fq=fastq, fa=fasta)
    outMode = options.outMode

    trimSpecs = options.clipping + "-" + options.fragments + "-" + options.mode
    trimSpecsList = [options.clipping, options.fragments,options.mode]

  elif runMode == 1:
    #the source file that will be trimmed
    filename = "brahSplits00-00.fastq"

    #folder in which the source file is located
    inFolder = ""

    #folder where the trimmed file will be saved
    outFolder = ""

    #size of the trim window
    windowSize = 1000

    #minimum length of each fastq sequence
    minLen = 1000

    #trim mode (D=dynamic, S=static)
    mode = "S"

    #Phred Quality score threshold
    Qcutoff = 0

    #number of sequences to analyze (0=all)
    maxDataMB = 100

    #u\output format of the trimmed file (fq=fastq, fa=fasta)
    outMode = ".fastq"

    #three trim settings:
    #[0]: leading / trailing trim flags L=leading, T=triailing
    #[1]: number of fragments to keep in fragmented mode (0=all)
    #[2]: fragmentation mode: N=no fragmentation, F=fragmented
    trimSpecs = "LT-U0-S"
    trimSpecsList = trimSpecs.split("-")
  elif runMode == 2:
    filename = sys.argv[1]
    inFolder = sys.argv[2]
    outFolder = sys.argv[3]
    windowSize = int(sys.argv[4])
    minLen = int(sys.argv[5])
    # mode = sys.argv[6][-1]
    Qcutoff = int(sys.argv[7])
    maxDataMB = int(sys.argv[8])
    outMode = sys.argv[9]
    # trimStats = sys.argv[10].split(".")
    trimSpecs = sys.argv[6]
    trimSpecsList = trimSpecs.split("-")



  #variables denoting the size of each buffer element: n*Mb = n megabytes
  n = 1
  Mb = 1024**2
  bufferSize = n*Mb

  #trim mode (D=dynamic, S=static)
  mode = trimSpecsList[2]

  #clip ends contains the leading/trailing trim flags
  clipEnds = trimSpecsList[0]

  #fragMode is the fragmentation flag
  fragMode = trimSpecsList[1][0] 

  #numFrags contains the number of frags to keep in fragmented mode
  numFrags = int(trimSpecsList[1][1:])

  #filepath combines the source file name witht he source folder to
  #produce the source filepath
  filepath = inFolder + os.path.basename(filename)

  # performs a trim of the sequence list ofr the given trim specs
  # also removes leading and trailing Ns and breaks 
  # fragmented sequences into contigs  
  print("start:")

  #generates a new blank file with the appropriate filename
  suffix1 = "Trim"
  outputFileName = generateOutputFilename(filename, outFolder, trimSpecs, \
                                          Qcutoff, windowSize, minLen, \
                                          maxDataMB, suffix1)
  if not os.path.exists("ProwlerProcessed"):
    os.makedirs("ProwlerProcessed")

  f = open(outputFileName+outMode,"w+")
  f.close()  

  f = open(filepath, "r")
  startTime = tm.time()

  # reads n Mb of data, then adds between 0-3 lines to ensure the number of 
  # lines is a multiple of 4 (4 lines per fastq element)
  fileLines = readBuffer(f, bufferSize)

  counter=0
  time0=tm.time()
  timeCurrent=0
  maxCount = round(maxDataMB,0)

  #analyzes buffer seqs in lots until the End of File or max count is reached
  while (len(fileLines) > 0) and (maxCount == 0 or counter <= maxCount):
    print(counter)
    counter+=1
#    print("count: ", counter)
    timePrevious = timeCurrent
    timeCurrent = tm.time()-time0
    timeStep = timeCurrent-timePrevious
    timeAvg = (timeCurrent+timeStep)/(counter)

#    print("time current: ", round(timeCurrent,2), "s")
#    print("time step   : ", round(timeStep,2), "s")
#    print("time average: ", round(timeAvg,2), "s")

    #turns line strings into fastq elements and collates them into a list
    seqs1 = analyzeFastqFile(fileLines, minLen)

    #performs rolling trim on all buffer seqs and returns coverage
    trimCoverage = fq.rollingTrimAllSeqs(seqs1,Qcutoff,windowSize,mode, minLen)

    # clips ends according to clipEnds setting
    clipFunction(clipEnds, seqs1)

    # Trims contig list based on min len and desired #fragments  
    seqs1 = fragFunction(seqs1, minLen, numFrags, fragMode)

    #writes current trimmed buffer seqs to file
    writeFile(filename, outFolder, seqs1, trimSpecs, Qcutoff, windowSize, minLen, \
              maxDataMB, suffix1, outMode)

    # reads n Mb of data, then adds between 0-3 lines to ensure the number of 
    # lines is a multiple of 4 (4 lines per fastq element)
    fileLines = readBuffer(f, bufferSize)

  f.close()

  #calculates the time taken to analyze all seqs
  endTime = tm.time()
  trimTime = endTime - startTime

  if not os.path.exists("ProwlerProcessed"):
    os.makedirs("ProwlerProcessed")

  filebase = os.path.basename(filename)
  #the name of the csv file that will contain the trim time and coverage stats
  fileOutName = outFolder + '/' + filebase[0:-6] + "Stats"+str(trimSpecs)+str(Qcutoff) \
             +"W"+str(windowSize)+"L"+str(minLen)+"R"+str(maxDataMB)+".csv"

  # the run specifications plus the trim time and coverage stats for the run
  trimOut = (suffix1 + "\t"+ trimSpecs +"\t"+str(Qcutoff)+"\t"+str(windowSize) \
             +"\t"+str(minLen)+"\t"+str(maxDataMB)+"\t"+str(trimCoverage) \
             +"\t"+str(trimTime)+"\n")

  #outputs time statistics to csv
  fo = open(fileOutName, "w+")
  fo.write(trimOut)
  fo.close()