A microbial genomic analysis pipeline used by the Djordjevic Lab

public public 1yr ago Version: v1.0.0 0 bookmarks

About

This repo contains a genomic analysis pipeline used for analysing mostly E. coli genomes, though it also sees use in analysing Salmonella genomes and some other species as well, however note that using this tool for analysing other species may cause certain tools to either break or will otherwise just limit the usefullness of this tool in the first place.

Disclaimer

While I have made every effort to ensure these scripts produce reliable and accurate output, it is important that you spot check your outputs to ensure they are correct :)

The documentation is still a work in progress but if you need any help don't hestitate to post an issue :)

Modules/Workflows

This pipeline has three main modules:

  1. Quality control

  2. Genotyping

  3. Phylogenetics

All individual rules/analyses can be turned on or off in the configuration file.

1. Quality control

Reads are first filtered with fastp before being assembled with shovill. Assembly stats are analysed using assembly-stats. Species ID is determined using Kraken and Bracken and a contamination and completeness screen is carried out using CheckM. GUNC is also run, though usually I disable this as the run-time can be a bit long and I haven't found the output especially helpful yet for my purposes, though this might change.

This information is then consolidated into a tab delimited QC report.

2. Genotyping

The word genotyping here is used a bit loosely but this step performs a variety of analyses using:

  • Abricate to screen for virulence, AMR, IS elements and other custom gene databases. New databases can be added and added to the configuration file to be included for use.

  • Antibiotic resistance gene detection is performed using abritamr

  • Point mutations associated with AMR are detected using CGE's pointfinder (there is some redundancy here with abritamr which now has this functionality)

  • MLST is performed using mlst

  • Plasmid MLST is performed using pMLST

  • Fimbrial adhesins are typed using fimtyper

  • Phylogroup detection is performed with Clermontyping and EZClermont

  • Serotyping is performed with ECtyper

  • Salmonella pathogenicity islands are detected using SPIfinder

  • Genome annotation is performed with Prokka

  • cgMLST is performed with cgMLSTfinder - this will be changed to chewBBACA

  • Plasmid screening is performed using a slightly older version of abricate (screening of large reference genes seemed to break after 0.9.8).

Most of these rules output tab delimited summary files relevant for each rule/analysis, some utilise some downstream scripts in R but not many of these.

3. Phylogenetics

This workflow runs a pangenome derived phylogenetic analysis utilising:

  • Prokka

  • Panaroo

  • SNP-sites

  • IQtree

Pairwise SNP distances are also determined using SNP-dists.

Installation and setup

Because there are many dependencies to install across all of the different analytical tools the first time you run the tool it can take some time to install everything via conda. Because of this I have added a script to install environments for the three modules setup_environments.sh

# Clone the git repository and enter it
git clone https://github.com/maxlcummins/pipelord.git
cd pipelord
# Setup and installation - this may take some time
sh setup_environments.sh

Configuration

Configuration of the snakemake workflow can be done via the configuration file in config . It is a good idea to just make a new one though by copying the template and editing it accordingly, then specifying it in the snakemake command (e.g. --configfile config/new_template.yaml ).

Usage

Note: Change your config file accordingly if you created a new one.

QC Workflow

#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/QC_workflow.smk 1> QC.log 2> QC.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > QC_jobID.txt

Genotyping

#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/Snakefile 1> genotype.log 2> genotype.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > genotype_jobID.txt

Pangenomic Phylogenetics

#Run job in the background
nohup snakemake --resources mem_mb={max_memory_per_rule} -j 20 -p --use-conda --configfile config/config_template.yaml --use-conda -s workflows/Treebuild_panaroo.smk 1> treebuild.log 2> treebuild.log &
#Save the job ID so you can cancel it later if need be
PROCESS_ID=$!
echo "JOB_ID =" "$PROCESS_ID" > treebuild_jobID.txt

Database update

To update your databases you can use a command like the following - make sure the db dir path is correct and change the database appropriately

abricate-get_db --dbdir resources/dbs/abricate --db vfdb --force

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
abricate_summaries = snakemake.input["abricate_summary"]

#Create an empty list
dataframes = []

for file in abricate_summaries:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'#FILE':'name'}, inplace=True)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
amrfinder = snakemake.input["amrfinder"]

#Create an empty list
dataframes = []

for file in amrfinder:
    df = pd.read_csv(file, sep="\t", index_col=None)
    df["name"] = file
    df["name"] = df["name"].str.replace("/amrfinder.out", "", regex=False)
    df["name"] = df["name"].str.replace(".*\/", "", regex=True)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Move the name column to the start
combined = combined[['name'] + [ col for col in combined.columns if col != 'name' ]]

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_resistance"], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import pandas as pd
import os.path
import re

#### Process our resistance data ####

# Directing python to the input from snakemake
abritamr_resistance = snakemake.input["abritamr_resistance"]

#Create an empty list
dataframes = []

for file in abritamr_resistance:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'Isolate':'name'}, inplace=True)

#Trim characters before the samplename remaining from the path
combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True)

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_resistance"], sep="\t", index=False)


#### Process our virulence data ####

# Directing python to the input from snakemake
abritamr_virulence = snakemake.input["abritamr_virulence"]

#Create an empty list
dataframes = []

for file in abritamr_virulence:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'Isolate':'name'}, inplace=True)

#Trim characters before the samplename remaining from the path
combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True)

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_virulence"], sep="\t", index=False)

#### Process our partial data ####

# Directing python to the input from snakemake
abritamr_partials = snakemake.input["abritamr_partials"]

#Create an empty list
dataframes = []

for file in abritamr_partials:
    df = pd.read_csv(file, sep="\t", index_col=None)
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'Isolate':'name'}, inplace=True)

#Trim characters before the samplename remaining from the path
combined['name'].replace({r".*\/" : ""}, inplace = True, regex = True)

#Write our output file to text
combined.to_csv(snakemake.output["abritamr_partials"], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import os.path

# Directing python to the input from snakemake
assembly_stats_summaries = snakemake.input["assembly_stats_summary"]
#Create an empty list
dataframes = []

#For each of our mlst files
for file in assembly_stats_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", index_col=None)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change the filename column to be name
combined.rename(columns={'filename':'name'}, inplace=True)

#Remove the path related info from our name column
combined["name"] = combined["name"].str.replace(".*/", "", regex=True)
combined["name"] = combined["name"].str.replace(".fasta", "", regex=False)
#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import pandas as pd
import os.path

# Directing python to the input from snakemake
bracken_reports = snakemake.input["bracken_reports"]


dataframes = []
for file in bracken_reports:
    df = pd.read_csv(file, sep="\t")
    df["source"] = os.path.basename(file)
    df["source"] = df["source"].str.replace(".bracken.txt", "", regex=False)
    dataframes.append(df)

combined = pd.concat(dataframes)

combined = combined[
    [
        "source",
        "name",
        "taxonomy_id",
        "taxonomy_lvl",
        "kraken_assigned_reads",
        "added_reads",
        "new_est_reads",
        "fraction_total_reads",
    ]
]

combined.columns = [
    "name",
    "scientific_name",
    "taxonomy_id",
    "taxonomy_lvl",
    "kraken_assigned_reads",
    "added_reads",
    "new_est_reads",
    "fraction_total_reads",
]

combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
clermontyping_summaries = snakemake.input['clermontyping']

print(clermontyping_summaries)

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in clermontyping_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None)
    #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0'
    df.columns = ['name', 'genes_detected', 'quadraplex_resuts', 'alleles_for_C_or_G', 'clermontyping_phylogroup', 'mashfile']
    #Append our dataframes
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Replace unwanted characters in our dataframe
combined['name'] = combined['name'].replace('\.fasta', '', regex=True)

#Write our simple file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import os.path

# Directing python to the input from snakemake
ectyper_summaries = snakemake.input["ectyper_summary"]

#Create an empty list
dataframes = []

#For each of our ectyper files
for file in ectyper_summaries:
    print(file)
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=0, index_col=None)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
ezclermont_summaries = snakemake.input['ezclermont_summary']

print(ezclermont_summaries)

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in ezclermont_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip')
    #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0'
    df.columns = ['datacol']
    name = df[df.datacol.str.contains('Reading in sequence')]  
    #Trim our filenames to use as a name column
    name.datacol = name.datacol.str.replace('Reading in sequence(s) from', '', regex=False)
    name.datacol = name.datacol.str.replace('.fasta', '', regex=False)
    #Create an empty dataframe
    df2 = pd.DataFrame()
    #Create our name column
    df2['name'] = name
    #Save the result for trpBA
    df2['trpBA'] = df[df.datacol.str.contains('trpBA_control')].iloc[0,0]
    print(df2['trpBA'])
    #Save the result for virA
    df2['virA'] = df[df.datacol.str.contains('virA:')].iloc[0,0]
    #Save the result for TspE4
    df2['TspE4'] = df[df.datacol.str.contains('TspE4:')].iloc[0,0]
    #Save the result for arpA:
    df2['arpA'] = df[df.datacol.str.contains('arpA:')].iloc[0,0]
    #Save the result for chu:
    df2['chu'] = df[df.datacol.str.contains('chu:')].iloc[0,0]
    #Save the result for yjaA:
    df2['yjaA'] = df[df.datacol.str.contains('yjaA:')].iloc[0,0]
    #Save the result for arpA:
    df2['arpA'] = df[df.datacol.str.contains('arpA:')].iloc[0,0]
    #Save the result for the Clermont_type:
    df2['Clermont_type'] = df[df.datacol.str.contains('Clermont type:')].iloc[0,0]
    #Append our dataframes
    dataframes.append(df2)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Select our columns we wish to change
colnames = combined.select_dtypes(['object']).columns

#Replace unwanted characters in our dataframe
combined[colnames] = combined[colnames].replace('.*: ', '', regex=True)

#Replace dashes and pluses with 1s and 0s
combined[colnames] = combined[colnames].replace('+', 1, regex=False)
combined[colnames] = combined[colnames].replace('-', 0, regex=False)

#Write our simple file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import pandas as pd
import os.path

# Directing python to the input from snakemake
fimtyper_summaries = snakemake.input["fimtyper_summary"]

#Read in out dataframe and add a name column
dataframes = []
for file in fimtyper_summaries:
    df = pd.read_csv(file, sep="\t")
    if 'Contig'in df.columns:
        ## If detect No FimH type found skip sample
        ## If detect "Please contact curator, if new fimH-type" skip line
        df["name"] = os.path.basename(file)
        df["name"] = df["name"].str.replace("_named.txt", "", regex=False)
        dataframes.append(df)
    else:
        print("No fimH type found in "+str(os.path.basename(file)))

#Combine our fimtyper files
combined = pd.concat(dataframes)

#Name our columns
combined = combined[
    [
        "name",
        "Fimtype",
        "Identity",
        "Query/HSP",
        "Contig",
        "Position in contig",
        "Accession no."
    ]
]

#Reorder our columns (probably redundant now)
combined = combined[ ['name'] + [ col for col in combined.columns if col != 'name' ] ]

#Remove accession number column (usually empty)
combined = combined.drop('Accession no.', axis=1)

#Add a new column for query length and HSP (High-scoring segment pair i.e. match) length
combined["Query length"] = combined["Query/HSP"].str.replace("/.*", "", regex=True)
combined["HSP length"] = combined["Query/HSP"].str.replace(".*/", "", regex=True)

#Add an asterisk to novel alleles which have a non 100% identity or a length other than 489
combined.loc[combined['Identity'] != 100.0, 'Fimtype'] = combined.loc[combined['Identity'] != 100.0, 'Fimtype']+"*"
combined.loc[combined['Query length'] != combined['HSP length'], 'Fimtype'] = combined.loc[combined['Query length'] != combined['HSP length'], 'Fimtype']+"*"

#Remove instances of double asterisks
combined["Fimtype"] = combined["Fimtype"].str.replace("**", "*", regex=False)

#Define pattern of lines we want to remove (Messages from devs to report novel alleles)
false_line_pattern = 'Please'

#Identify rows which are lines we want to remove
pattern_filter = combined['Fimtype'].str.contains(false_line_pattern)

#Remove unwanted rows
combined = combined[~pattern_filter]

#Write to file
combined.to_csv(snakemake.output[0], sep="\t", index=False)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
pmlst_summaries = snakemake.input['pMLST_summary']

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in pmlst_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip')
    #Trim our filenames to use as a name column
    name = re.sub(".out/results.txt","", file)
    name = re.sub(".*/", "", name)
    #Assign a column name to our input which isnt actually a CSV and therefore has a heading of '0'
    df.columns = ['datacol']
    #Identify pMLST data pertaining to IncF plasmids
    IncF_results = df['datacol'].str.contains('IncF RST').any()
    #Only process IncF data.
    #Wasteful loop iterations but snakemake was being painful so I decided on feeding all pMLST input rather than just IncF
    if IncF_results:
        #Pull out the columns with F replicons
        df = df[df.datacol.str.contains('^FI')]
        #Convert multiple spaces to commas to properly delimit our file
        df['datacol'] = df['datacol'].str.replace('  +', ',', regex=True)
        #Split our string by comma delimiter into different columns
        df = df['datacol'].str.split(",", expand=True)
        #Assign our name column
        df['name'] = name
        #Append our dataframes
        dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change column order
combined.columns = [
    'Locus',
    'Identity',
    'Coverage',
    'Alignment Length',
    'Allele Length',
    'Gaps',
    'Allele',
    'name'
    ]

#Change column order
combined = combined[[
    'name',
    'Locus',
    'Identity',
    'Coverage',
    'Alignment Length',
    'Allele Length',
    'Gaps',
    'Allele'
    ]
]

#Simplify when no allele is found with a dash
combined['Allele'] = combined['Allele'].str.replace('No hit found', '-', regex=False)

#Create a column for a simplified allele call
combined['Simple Allele'] = combined['Allele']
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIS', 'S', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIY', 'Y', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FIIK', 'K', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('FI', '', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('I', 'F', regex=True)
combined['Simple Allele'] = combined['Simple Allele'].str.replace('_', '', regex=True)

#Separate our alleles to combine them in the correct order for IncF RST
#Sort by allele name. If they are instead sorted in order of discovery (via contig number) then we will see inconsitencies between samples
FICs = combined[combined['Locus'] == 'FIC'].sort_values(by='Simple Allele', ascending=False)
FIIs = combined[combined['Locus'] == 'FII'].sort_values(by='Simple Allele', ascending=False)
FIKs = combined[combined['Locus'] == 'FIIK'].sort_values(by='Simple Allele', ascending=False)
FISs = combined[combined['Locus'] == 'FIIS'].sort_values(by='Simple Allele', ascending=False)
FIYs = combined[combined['Locus'] == 'FIIY'].sort_values(by='Simple Allele', ascending=False)

FIA_type_df = combined[combined['Locus'] == 'FIA'].sort_values(by='Simple Allele', ascending=False)

FIB_type_df = combined[combined['Locus'] == 'FIB'].sort_values(by='Simple Allele', ascending=False)

#Combine our dataframes again
F_type_df = pd.concat([FIIs, FICs, FIKs, FISs, FIYs])
F_type_df = F_type_df[F_type_df['Simple Allele'] != '-']

#Group by name and collapse our FIIs, FICs, FIKs, FISs and FIYs with a slash separator
F_type = F_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index()

#Assign our column names
F_type.columns = ['name', 'F type']

#Group by name and collapse our FIBs with a slash separator
FIA_type = FIA_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index()

#Assign our column names
FIA_type.columns = ['name', 'FIA type']

#Group by name and collapse our FIBs with a slash separator
FIB_type = FIB_type_df.groupby('name').apply(lambda x: '/'.join(x['Simple Allele'])).reset_index()

#Assign our column names
FIB_type.columns = ['name', 'FIB type']

#Join our columns
IncF_RSTs = pd.merge(F_type, FIA_type, on='name',how='outer')
IncF_RSTs = pd.merge(IncF_RSTs, FIB_type, on='name',how='outer')

#Replace NaNs (for missing alleles) with a dash
IncF_RSTs = IncF_RSTs.fillna('-')

#Replace the cell contents for F types to make them better reflect null values 
IncF_RSTs['F type'] = IncF_RSTs['F type'].str.replace('-', 'F-', regex=True)
IncF_RSTs['FIA type'] = IncF_RSTs['FIA type'].str.replace('-', 'A-', regex=True)
IncF_RSTs['FIB type'] = IncF_RSTs['FIB type'].str.replace('-', 'B-', regex=True)

#Generate an IncF RST column by combining the alleles
IncF_RSTs['IncF RST'] = IncF_RSTs['F type']+':'+IncF_RSTs['FIA type']+':'+IncF_RSTs['FIB type']

#Make a simple IncF RST dataframe to bind to our full IncF RST dataframe
IncF_RSTs_simple = IncF_RSTs[['name', 'IncF RST']]

#Join our columns
combined = pd.merge(combined, IncF_RSTs_simple, on='name',how='outer')

#Write our simple file to text
IncF_RSTs.to_csv(snakemake.output[0], sep="\t", index=False)

#Write our detailed file to text
combined.to_csv(snakemake.output[1], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import pandas as pd
import os.path

# Directing python to the input from snakemake
mlst_summaries = snakemake.input["mlst_summary"]

#Create an empty list
dataframes = []

#For each of our mlst files
for file in mlst_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, index_col=None)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Extract the number of columns
numCols = combined.shape[1]

#Define the first three column names
mlst_colnames  =  [
    'name',
    'scheme',
    'ST'
]

#Create a list of column names for our alleles
allele_colnames = ["allele"+str(x) for x in range(1, numCols-2)]

#Combine our lists of column names
mlst_colnames = mlst_colnames + allele_colnames

#Assign our list of column names
combined.columns = mlst_colnames

#Remove the path related info from our name column
combined["name"] = combined["name"].str.replace(".*/", "", regex=True)
combined["name"] = combined["name"].str.replace(".fasta", "", regex=False)

#Fill empty columns with 0
combined = combined.fillna(0)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
import os.path
import re

# Directing python to the input from snakemake

mobsuite = snakemake.input["mobsuite"]

# Create an empty list

dataframes = []

# Initialise a for loop for processing our pMLST outputs

for file in mobsuite:

    # Read it in as df

    df = pd.read_csv(file, sep="\t", header=0, on_bad_lines="skip")

    # Append our dataframes
    dataframes.append(df)

# Concatenate our dfs into a single one
combined = pd.concat(dataframes)

# Write our output file to text

combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import pandas as pd
import os.path
import re

# Directing python to the input from snakemake
pmlst_summaries = snakemake.input["pMLST_summary"]

#Create an empty list
dataframes = []

#Initialise a for loop for processing our pMLST outputs
for file in pmlst_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", header=None, on_bad_lines='skip')
    #Trim our filenames to use as a name column
    name = re.sub(".out/results.txt","", file)
    name = re.sub(".*/", "", name)
    #Assign our name column
    df['name'] = name
    #Reassign name columns. The input isnt actually a CSV so we label the original input as data and redundantly reassign our name column for the names
    df.columns = ['datacol', 'name']
    #Isolate pMLST scheme info
    df_profile = df[df.datacol.str.contains('pMLST profile')]
    #Assign colnames to our new df with pMLST scheme info
    df_profile.columns = ['pMLST_scheme', 'name']
    #Isolate pMLST info    
    df_pMLST = df[df.datacol.str.contains('Sequence Type')]
    #Assign colnames to our new df with pMLST info
    df_pMLST.columns = ['Sequence_type', 'name']
    #Merge our dataframes with this new info
    df2 = pd.merge(df_profile, df_pMLST)
    #Check if there are any lines which indicate that no MLST loci were found
    no_loci = df.datacol.str.contains('No MLST loci was found').any()
    #Check if there are any lines which indicate that either a novel ST was found or no ST was found
    #pMLST tool doesn't discriminate these two particularly well on this line.
    unknown_pMLST = df.datacol.str.contains('Unknown').any()
    #If the file indicates no MLST loci were found AND that the genome has a novel or NA ST then assign it as 'None'
    if no_loci > 0 and unknown_pMLST > 0:
        df2.Sequence_type = 'None'
    #If the file indicates that some MLST loci were found but the tool still couldnt assign a pMLST then assign it as 'Novel'
    elif no_loci == 0 and unknown_pMLST > 0:
        df2.Sequence_type = 'Novel'
    #Pull out lines which indicate the closest STs
    df_nearest_pMLST = df[df.datacol.str.contains('Nearest ST')]
    #Assign column names to our new dataframe with nearest STs
    df_nearest_pMLST.columns = ['Nearest_ST', 'name']
    #Trim unwanted chars preceeding nearest ST data
    df_nearest_pMLST.Nearest_ST = df_nearest_pMLST['Nearest_ST'].str.replace("Nearest ST: ","", regex=False)
    #Trim unwanted chars preceeding nearest ST data. Same again but for when multiple matches of equal weight are identified
    df_nearest_pMLST.Nearest_ST = df_nearest_pMLST['Nearest_ST'].str.replace("Nearest STs: ","", regex=False)
    #Merge the dataframe with nearest ST data with our info on scheme and ST
    df2 = pd.merge(df2, df_nearest_pMLST, on='name', how='outer')
    #Trim unwanted chars preceeding scheme info
    df2.pMLST_scheme = df2.pMLST_scheme.str.replace("pMLST profile: ","", regex=False)
    #Trim unwanted chars - after a space is just the word pmlst
    df2.pMLST_scheme = df2.pMLST_scheme.str.replace(" .*","", regex=True)
    #Trim unwanted chars in IncA/C scheme - slashes are naughty characters
    df2.pMLST_scheme = df2.pMLST_scheme.str.replace("/","", regex=False)
    #Convert schemes lowercase
    df2.pMLST_scheme = df2.pMLST_scheme.str.lower()
    #Trim unwanted chars preceeding ST
    df2.Sequence_type = df2.Sequence_type.str.replace("Sequence Type: ","", regex=False)
    #Trim unwanted chars before and after our IncF RSTs
    df2.Sequence_type = df2.Sequence_type.str.replace("[","", regex=False)
    df2.Sequence_type = df2.Sequence_type.str.replace("]","", regex=False)
    #Append our dataframes
    dataframes.append(df2)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Change column order
combined = combined[[
    'name',
    'pMLST_scheme',
    'Sequence_type',
    'Nearest_ST'
    ]
]
#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import os.path

# Directing python to the input from snakemake
assembly_stats_summaries = snakemake.input["pointfinder_summary"]
#Create an empty list
dataframes = []

#For each of our mlst files
for file in assembly_stats_summaries:
    #Read it in as df
    df = pd.read_csv(file, sep="\t", index_col=None)
    df["name"] = df["name"].str.replace("_blastn_results.tsv", "", regex=False)
    df["name"] = df["name"].str.replace(".*/", "", regex=True)
    #Add the df to our list of dfs
    dataframes.append(df)

#Concatenate our dfs into a single one
combined = pd.concat(dataframes)

#Write our output file to text
combined.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
import os.path

# Directing python to the input from snakemake
spifinder_summaries = snakemake.input["spifinder_summary"]


dataframes = []
for file in spifinder_summaries:
    df = pd.read_csv(file, sep="\t")
    df["source"] = os.path.basename(file)
    df["source"] = df["source"].str.replace(".txt", "", regex=False)
    dataframes.append(df)

combined = pd.concat(dataframes)

combined = combined[
    [
        "Database",
        "SPI",
        "Identity",
        "Query / Template length",
        "Contig",
        "Position in contig",
        "Organism",
        "Insertion Site",
        "Category Function",
        "Note",
        "Accession number",
        "source"
    ]
]

combined.columns = [
    "Database",
    "SPI",
    "Identity",
    "Query / Template length",
    "Contig",
    "Position in contig",
    "Organism",
    "Insertion Site",
    "Category Function",
    "Note",
    "Accession number",
    "name"
]

combined = combined[ ['name'] + [ col for col in combined.columns if col != 'name' ] ]

combined.to_csv(snakemake.output[0], sep="\t", index=False)
20
21
22
23
24
shell:
    """
    amrfinder -U &> {log}
    touch {output}
    """
43
44
45
46
shell:
    """
    abritamr run --prefix {params.abritamr_out}/{wildcards.sample} --contigs {input.contigs}  --jobs {threads} --species {params.species} &> {log}       
    """
60
61
script:
    "../../scripts/combine_abritamr.py"
71
72
script:
    "../../scripts/combine_abritamr_amrfinder_raw.py"
40
41
42
43
44
45
shell:
    """
    mkdir -p {output.cgmlstfinder_out}
    resources/tools/cgmlstfinder/cgMLST.py --shared_memory -s ecoli -db /projects/AusGEM/databases/cgmlstfinder_db -o {output.cgmlstfinder_out} -k $CONDA_PREFIX/bin/kma {input.r1_filt} {input.r2_filt} 
    cp {output.cgmlstfinder_out}/ecoli_summary.txt {output.cgmlst_summary}
    """
19
20
shell:
    "ectyper -i {input} -o {output[0]} &> {log}"
32
33
script:
    "../../scripts/combine_ectyper.py"
27
28
29
30
31
shell:
    """
    python3 resources/tools/fimtyper/VALIDATE_DB {input} 2> {log}
    touch {output}
    """
48
49
50
51
52
shell:
    """
    perl resources/tools/fimtyper/fimtyper.pl -d resources/tools/fimtyper/fimtyper_db -b $CONDA_PREFIX -i {input.assembly} -o {params.output_dir} -k 95.0 -l 0.6 &> {log}
    cp {output.raw} {output.renamed}
    """
61
62
63
64
65
shell:
    """
    sed '0,/^FimH type -/d' {input} > {output.trimmed}
    awk 'NR == 1 {{print "name\t" $0; next;}}{{print FILENAME "\t" $0;}}' {output.trimmed} > {output.named}
    """
77
78
script:
    "../../scripts/combine_fimtyper.py"
18
19
20
21
22
23
24
25
26
27
shell:
    """
    if [ ! -d "resources/dbs/dfast/protein" ]; then echo 'dfast protein dbs directory not located, downloading to resources/dbs/dfast/protein...'
    #git clone https://github.com/nigyta/dfast_core.git
    dfast_file_downloader.py --dbroot resources/dbs/dfast --protein dfast
    fi
    if [ ! -d "resources/dbs/dfast/hmm" ]; then echo 'dfast hmm dbs directory not located, downloading to resources/dbs/dfast/hmm...'
    dfast_file_downloader.py --dbroot resources/dbs/dfast --cdd Cog --hmm TIGR
    fi
    """
40
41
42
43
shell:
    """
    dfast -g {input.assembly} --cpu {threads} --dbroot {input.database} -o {output} --use_locustag_as_gene_id
    """
53
54
55
56
shell:
    """
    mv {input}/genome.gff {output}
    """
68
69
70
71
shell:
    """
    prokka --cpus {threads} --outdir {output} {input.assembly} 
    """
81
82
83
84
shell:
    """
    mv {input}/PROKKA_*.gff {output}
    """
94
95
96
97
98
99
shell:
    """
    if [ ! -d "{output}" ]; then echo 'bakta db not located, downloading to {output}'
    bakta_db download --output {output} 1&2> {log}
    fi
    """
114
115
116
117
118
shell:
    """
    bakta --db {params.bakta_db} --verbose --output {params.bakta_out} --prefix {wildcards.sample} --threads {threads} {input[1]}
    cp {output}/*.log {log}
    """
130
131
132
133
shell:
    """
    mv {input}/*.gff3 {output} &> {log}
    """
13
14
15
16
shell:
    """
    cp -n {input} {output}
    """
33
34
35
36
37
shell:
    """
    shovill --minlen 200 --cpus {threads} --outdir {output.shov_out} --R1 {input.r1_filt} --R2 {input.r2_filt} 1> {log.out} 2> {log.err}
    cp {output.shov_out}/contigs.fa {output.assembly}
    """
50
51
shell:
    "assembly-stats -t {input} > {output}"
59
60
61
62
63
shell:
    """
    mkdir {output}
    cp {input} {output}
    """
75
76
script:
    "../../scripts/combine_assembly_stats.py"
29
30
shell:
    "abricate --nopath --datadir {params.datadir} --threads {threads} --db {params.db} {input.assembly} > {output} 2> {log}"
47
48
shell:
    "abricate --nopath --datadir {params.datadir} --threads {threads} --db {params.db} {input.assembly} > {output} 2> {log}"
63
64
65
66
67
68
69
70
71
shell:
    """"
    cat {input} > {output.out1}
    perl -p -i -e 's@\.fasta@@g' {output.out1}
    #Rename first column name
    sed -E '1s/#FILE/name/g' {output.out1} > {output.out2}
    #Remove duplicate headers by negative grep
    grep -v '#FILE' {output.out2} > {output.out}
    """
83
84
script:
    "../../scripts/combine_abricate.py"
17
18
19
20
shell:
    """
    mob_init -d resources/dbs/mobsuite 1> {log.out} 2> {log.err}
    """
37
38
shell:
    "mob_recon --force --num_threads {threads} --infile {input.assembly} --outdir {params.outdir} 1> {log.out} 2> {log.err}"
52
53
script:
    "../../scripts/combine_mobsuite.py"
17
18
19
20
21
shell:
    """
    mkdir -p {params.ezclermont_outdir}
    ! ezclermont {input} --logfile {output.full} > {output.simple}
    """
33
34
script:
    "../../scripts/combine_ezclermont.py"
45
46
47
48
shell:
    """
    git clone https://github.com/A-BN/ClermonTyping.git resources/tools/clermontyping
    """
64
65
66
67
68
69
70
shell:
    """
    {input.github_repo}/clermonTyping.sh --fasta {input.assembly} --name {wildcards.sample} --minimal &> {log}
    ls {params}
    mv {wildcards.sample}/* {params}
    rmdir {wildcards.sample}
    """
82
83
script:
    "../../scripts/combine_clermontyping.py"
33
34
35
36
37
shell:
    """
    python3 {params.pmlst_script_path} -i {input} -o {params.output_dir} -p {params.pmlst_db_path} {params.pmlst_tool} $CONDA_PREFIX/bin/blastn -s {params.db} -x -t {params.tmp}
    rm -rf {params.tmp}
    """
51
52
script:
    "../../scripts/combine_pMLST.py"
61
62
63
64
shell:
    """
    touch {input}
    """
80
81
script:
    "../../scripts/combine_IncF_RST.py"
29
30
31
32
shell:
    """
    resources/tools/pointfinder/PointFinder.py -i {input} -o {params.output_dir} -p resources/tools/pointfinder/pointfinder_db {params.species} -m blastn -m_p $CONDA_PREFIX/bin/blastn 2> {log}
    """
40
41
shell:
    """awk 'NR == 1 {{print "name\t" $0; next;}}{{print FILENAME "\t" $0;}}' {input} > {output}"""
53
54
script:
    "../../scripts/combine_pointfinder.py"
26
27
28
29
shell:
    """
    fastp -i {input.r1} -I {input.r2} -o {output.r1_filt} -O {output.r2_filt} --thread {threads} -j {output.json} -h {output.html} 2>&1 {log}
    """
6
7
8
9
shell:
    """
    cp -n {input} {output}
    """
25
26
shell:
    "roary -p {threads} -e -v -n -r -cd {params.core_req} -qc -k {params.kraken_db} -f {params.out} {input}  2> {log}"
39
40
shell:
    "snp-sites -c {input} > {output}"
53
54
shell:
    "iqtree -s {input} -m MFP -ntmax {threads} -bb 1000"
68
69
shell:
    "iqtree -s {input} -m MFP -ntmax {threads} -bb 1000"
38
39
40
41
shell:
    """
    kraken2 --db {input.db} --use-names --memory-mapping --threads {threads} --report {output.report} --output {output.out} {input.assembly}  2> {log}
    """
59
60
61
62
shell:
    """
    kraken2 --db {input.db} --use-names --memory-mapping --threads {threads} --report {output.report} --output {output.out} {input.r1_filt} {input.r2_filt}  2> {log}
    """
82
83
shell:
    "resources/tools/Bracken/bracken -d {input.db} -i {input.report} -o {output.bracken} -w {output.species} -r 100 -l S -t {threads} 2>&1 {log}"
103
104
script:
    "../../scripts/combine_bracken.py"
41
42
43
44
45
46
shell:
    """
    mkdir -p {output.spifinder_out}
    resources/tools/spifinder/spifinder.py -p resources/tools/spifinder_db -x -o {output.spifinder_out} -l 0.60 -t 0.95 -mp $CONDA_PREFIX/bin/kma -i {input.r1_filt} {input.r2_filt}
    cp {output.spifinder_out}/results_tab.tsv {output.spifinder_summary}
    """
59
60
script:
    "../../scripts/combine_spifinder.py"
77
78
79
80
81
82
shell:
    """
    mkdir -p {output.spifinder_out}
    resources/tools/spifinder/spifinder.py -p resources/tools/spifinder_db -x -o {output.spifinder_out} -l 0.60 -t 0.95 -mp $CONDA_PREFIX/bin/blastn -i {input}
    cp {output.spifinder_out}/results_tab.tsv {output.spifinder_summary}
    """
95
96
script:
    "../../scripts/combine_spifinder.py"
21
22
shell:
    "mlst {input} > {output}"
34
35
script:
    "../../scripts/combine_mlst.py"
ShowHide 56 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/maxlcummins/pipelord
Name: pipelord
Version: v1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...