Ancestral Fasta Generation Workflow for Population Genomics

public public 1yr ago 0 bookmarks

make-ancestral-fasta-snakeflow

Introduction

For a lot of interesting population genomics work today, it is necessary or useful to know which allele at a SNP is derived, and which is ancestral in the species under study. For many cases, that information can be obtained (or at least approximated) by finding the base carried at each such site in a closely related species. An argument could be made for getting as many closely related species as possible, putting them all on a phylogenetic tree, along with the focal species, and then doing a gigantic, genome-wide multiple alignment and inferring the ancestral states at each position from all that information. That sounds like a lot of work when you are trying to get a whole genome’s worth of ancestral states, and in many cases you might not have a whole lot of genomes to work with, anyway. So, here, I will take a cue from an ANGSD website example here where they use chimpanzee to designate ancestral states so as to polarize the site frequency spectrum.

For both the program ANGSD and the more recently developed software RELATE this information about ancestral alleles/bases can be input to the programs as a fasta file of the same length as the focal species fasta file, but just including the sequence of the closely related species.

It seems to me that there are not a lot of really great workflows for creating such fasta files. That is what I am trying to provide here. My main goal for it is aligning large chunks of genomes between closely related salmonid species, and then creating fasta files giving the ancestral alleles.

For example, if I have a lot of high-quality, whole-genome sequencing from O. mykiss that I want to phase and throw into RELATE, I can align the Chinook salmon genome to the O. mykiss genome and use that to make the ancestral fasta. In this case, we say that the O. mykiss genome is our target genome, and Chinook salmon is our query genome. For the most part, I am most interested in segments that have been assembled into chromosomes in both species.

Especially because of the duplicated genome history of salmonids, it is worth taking a look at the results of an original round of alignment in order to carefully select which chromosomes appear to be homologous. Therefore, this workflow must be down in two steps:

I. Do an initial round of alignment and plot some results from that. From these results, create a CSV file like that found in config/homolog_sets.csv , which tells us which query-species chromosomes appear to be homologous to each target-species chromosome. The first 10 lines of that file appear here.

target,homologs
CM007935.1,CM009207.1
CM007936.1,CM009220.1 CM009224.1
CM007937.1,CM009204.1
CM007938.1,CM009202.1 CM009219.1
CM007939.1,CM009206.1 CM009221.1
CM007940.1,CM009205.1
CM007941.1,CM009208.1
CM007942.1,CM009206.1 CM009211.1
CM007943.1,CM009211.1 CM009217.1

Note that if multiple query chromosomes align to a target chromsome, they are separated by spaces.

  1. Using the file config/homolog_sets.csv , map only the appropriate homologs to each target chromosome, and then turn the result into the ancestral fasta file.

Because there are two distinct steps to this workflow, and there has to be some user interaction, there is not a simple all rule that runs everything. Rather, explicit output files must requested from snakemake on the command line, as described in the sections of the overview below.

Overview of the two parts of the workflow

Finding query chromosomes homologous to each target chromosome

In brief, this part of the snakemake workflow does this:

  1. Choose the target genome and declare which sequences in that you want to align things to. For my purposes this is typically the assembled chromosomes.

  2. Choose the query genome and declare which sequences from the query you wish to map to the target. Once again, this is typically the assembled chromosomes, though one might wish to use all the sequences, even those that are not fully assembled into chromosomes.

  3. Break the target genome into a bunch of smaller fastas, each containing exactly one of the sequences/chromosomes. We will call these “single-chromosome fastas.”

  4. Map all the query sequences against each of the single-chromosome fastas.

  5. Run single_cov2 on each of resulting MAFs, so that we retain only the very best alignment of any overlapping alignments.

  6. Summarize how many base pairs are aligned onto each single-chromosome fasta from each of the different target query sequences (i.e. chromosomes). And also prepare a file of alignments that is suitable for plotting with ggplot.

These steps are all accomplished by running the workflow to this point with a command like:

snakemake --cores 20 --use-conda \
 results/report/step20_notransition_inner1000_ident92/aligned_lengths.tsv \
 results/report/step20_notransition_inner1000_ident92/plotable_segments.tsv

BigNote: The above command requests by way of the directory name used (i.e., step20_notransition_inner1000_ident92 ), that the following options be passed to lastz : --steps=20 --notransition --inner=1000 --identity=92. Users wanting different values of those parameters can simply request the output file in a different directory, like step15_notransition_inner950_ident97 , and snakemake will take care of it. This way, it makes it pretty easy to do multiple runs at different parameter options and collect all the results.

We can make a rulegraph for those outputs like this:

snakemake --rulegraph results/report/step20_notransition_inner1000_ident92/aligned_lengths.tsv results/report/step20_notransition_inner1000_ident92/plotable_segments.tsv | dot -Tsvg > figs/rulegraph1.svg

It looks like this:

Once those two output files are available, they can be used to create some plots using the not-so-well-documented code in directory `misc` like this:

The plot above (made in `misc/plot-aligned-length.R) shows that of the material that aligns to each chromosome, most of it comes from one, two, or three different query chromosomes.

When you go assigning homologs to chromosomes, however, it is also important to see if a section of a query chromosome appears to align to two places (which should not happen, usually), or if there are paralogous segments that are mapping anywhere. To answer those questions, it is helpful to make a huge cross-alignment facet grid like the one made by the script misc/homology-grid.R :

It is pretty easy to go through this by eyeball and choose the query chromosomes that belong in each row of the config/homolog_sets.csv files. Basically, dark lines in the columns represent parts of different query chromosomes that map to the same target chromosome. Dark lines in the rows represent parts of a single query chromosome that align to different target chromsomes. You want to choose things so that no part of a query chromosome is going to get re-used. This means that the same part of the query chromosome should not appear in multiple columns. Also, mapping of paralogs is somewhat evident here by the (somewhat lighter) lines. These should not be paired with the target chromosomes.

Once the file config/homolog_sets.csv is made, you can proceed to the next section.

Mapping homologous chromosomes and creating a fasta

The contents of config/homolog_sets.csv are used to guide the mapping in the second half of the workflow. In brief, the steps in this part of the snakemake workflow are as follows:

  1. Create a query fasta file for each target chromosome that contains the query chromosomes that have some homologous material to the target.

  2. Run lastz to map each query fasta against each target fasta. (You might want to use more stringent alignment criteria for this.)

  3. Run single_cov2 on each of the outputs.

  4. Run maf2fasta (from the multiz package) on each the resulting single_cov2 outputs.

  5. Use an R-script to condense the resulting fastas into a fasta that is congruent with the target genome. And at the same time, prepare a summary of the number of bases in the alignment for that chromosome that are of different types (e.g. target has a C and query has an A , etc.).

  6. Catenate all the those condensed fastas into a file with a path like:

results/catenated_anc_fasta/step20_notransition_inner1000_ident95/ancestral.fna

and catenate all the summaries into a file with a path like this:

results/report/step20_notransition_inner1000_ident95/pairwise_aligned_base_summary.csv

This is achieved by, for example, doing:

snakemake --cores 20 --use-conda --use-envmodules \ 
 results/catenated_anc_fasta/step20_notransition_inner1000_ident95/ancestral.fna \  
 results/report/step20_notransition_inner1000_ident95/pairwise_aligned_base_summary.csv

The rulegraph for this second section of the workflow can be obtained with this:

snakemake --rulegraph results/catenated_anc_fasta/step20_notransition_inner1000_ident95/ancestral.fna results/report/step20_notransition_inner1000_ident95/pairwise_aligned_base_summary.csv | dot -Tsvg > figs/rulegraph2.svg

And it looks like this:

Configuration

All the configuration can be done by modifying the contents of the config directory.

This workflow is set up showing the values used for mapping the Chinook genome (the query) to the O. mykiss genome (our target). You can change values in the config/config.yaml file to work for your own pair of closely related species.

More on this later, but it should be pretty self-explanatory for anyone familiar with snakemake.

Code Snippets

14
15
16
shell:
  "wget -O {output.fna}.gz {params.url} 2> {log.wget}; "
  " gunzip {output.fna}.gz  2> {log.gunzip}"
28
29
shell:
  "samtools faidx {input.fna} 2> {log}"
16
17
shell:
	"samtools faidx {input.tfasta} {params.tc} > {output} 2> {log}"
34
35
shell:
	"samtools faidx {input.qfasta} {params.qc} > {output} 2> {log}"
59
60
shell:
	"lastz {input} {params} > {output} 2> {log}"
75
76
shell:
	"single_cov2 {input} > {output} 2> {log}"
22
23
shell:
	"samtools faidx {input.qfasta} {params.homos} > {output} 2> {log}"
48
49
shell:
	"lastz {input} {params} > {output} 2> {log}"
64
65
shell:
	"single_cov2 {input} > {output} 2> {log}"
80
81
shell:
	"maf2fasta {input.target_fna} {input.maf} fasta  > {output} 2> {log}"
100
101
script:
	"../scripts/condense-and-summarise-fastas.R"
111
112
shell:
	"cat {input} > {output} 2> {log}"
122
123
shell:
	"cat {input} > {output} 2> {log}"
134
135
shell:
	"awk 'NR==1 {{header=$0; print; next}} $0==header {{next}} {{print}}' {input} > {output} 2> {log} "
7
8
shell:
	"workflow/scripts/count_aligned_bases.sh {wildcards.tchrom} {input} > {output}"
17
18
shell:
	"(echo -e 'target\tquery\tnum_bases\tfraction'; cat {input}) > {output}"
26
27
shell:
	"workflow/scripts/plotable-segs-from-maf.sh {input} > {output}"
35
36
shell:
	"(echo -e 'target\ttstart\ttend\ttstrand\tquery\tqstart\tqend\tqstrand'; cat {input}) > {output}"
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
log <- file(snakemake@log[[1]], open="wt")
sink(log, type = "output")
sink(log, type = "message")



#### get all the snakemake variables ####
maf2fasta_output <- snakemake@input$maf2fasta

final_fasta <- snakemake@output$anc_fasta
pair_summary_file <- snakemake@output$bp_pairs
mask_fasta <- snakemake@output$mask_fasta

# we need the target sequence name for a few things.
# because lastz munges names with dots in them, we just use the output
# file name, which has the {tchrom} on it.
tchrom = snakemake@wildcards$tchrom


#### Do the rest ####

library(tidyverse)


# read the sequences in.  They are on every other line.  The first
# is the target, the following are separate ones for each query chromosome
seqs <- read_lines(maf2fasta_output)

# break those sequences into a list of vectors
seq_vec_list <- str_split(seqs[seq(2, length(seqs), by = 2)], pattern = "")

snames <- seqs[seq(1, length(seqs), by = 2)]
rm(seqs)

# if there were multiple query sequences, condense them into a single one.
# the "-" has the lowest value of any of the possible letters in the aligned
# sequences, so this step takes the base at the aligned query sequence, if there
# is an aligned query sequency. (Remember we did single_cov2 so there will be only
# one aligned query at any point.)
if(length(seq_vec_list) > 2) {
  anc <- do.call(pmax, seq_vec_list[-1])
} else {
  anc <- seq_vec_list[[2]]
}

# now, we count up the number of different types of sites
pair_counts <- table(paste(seq_vec_list[[1]], anc))

# and make a tibble of those numbers
count_summary <- tibble(
  name = names(pair_counts),
  n = as.integer(pair_counts)
) %>%
  separate(name, into = c("target", "ancestral"), sep = " ") %>%
  mutate(chrom = tchrom, .before = target)

# and write that file out
write_csv(count_summary, file = pair_summary_file)


# and now, from anc, we subset out the sites that are "-"s in the target.
# That gives us an ancestral sequence that is congruent with the
# original target sequence. And then we replace the "-"'s in the ancestral
# seq with Ns
anc_fasta_seq <- anc[seq_vec_list[[1]] != "-"]
anc_fasta_seq[anc_fasta_seq == "-"] <- "N"


# We also create a fasta that masks the sites that are considered
# to be in repeat regions in either the target or the ancestral
# genome.  This is done by marking any position that is not a
# capital letter A, C, G, or T
# in both sequences an N, and anything that is a capital lettter
# A, C, G, or T in both a P.
# Note that this also chucks sites that are designated by IUPAC codes
# in the query.  Hmmm....
targ_seq <- seq_vec_list[[1]]
targ_seq <- targ_seq[targ_seq != "-"]

mask_seq = ifelse(
  (targ_seq %in% c("A", "C", "G", "T")) & (anc_fasta_seq %in% c("A", "C", "G", "T")),
  "P",
  "N"
  )




# now make a matrix out of those (anc_fasta_seq and mask_seq) to print them in lines.
# We have to extend each with NAs to be a length that is a multiple of 70 so that
# it doesn't get chopped off when we make a matrix of it for fast printing.


final_line_bits <- length(anc_fasta_seq) %% 70
if(final_line_bits != 0) {
  length(anc_fasta_seq) <- length(anc_fasta_seq) + (70 - final_line_bits)
  length(mask_seq) <- length(mask_seq) + (70 - final_line_bits)
}

# now, make a matrix of those and write them out.  We ensure that
# the NA's in the last row of the matrix are just empty (i.e. ""'s).


# writing out the ancestral fasta
cat(">", tchrom, "\n", sep = "", file = final_fasta)
matrix(anc_fasta_seq, ncol = 70, byrow = TRUE) %>%
  write.table(
    file = final_fasta,
    sep = "",
    na = "",
    quote = FALSE,
    append = TRUE,
    row.names = FALSE,
    col.names = FALSE
  )



# writing out the mask fasta
cat(">", tchrom, "\n", sep = "", file = mask_fasta)
matrix(mask_seq, ncol = 70, byrow = TRUE) %>%
  write.table(
    file = mask_fasta,
    sep = "",
    na = "",
    quote = FALSE,
    append = TRUE,
    row.names = FALSE,
    col.names = FALSE
  )
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
CHROM=$1
INFILE=$2

awk -v chrom=$CHROM '
	/^a/ {seq=0}
	/^s/ {seq++}
	/^s/ && seq==2 {
		n[$2]+=length($7); 
		tot+=length($7);
		seq++
	} 
	END {
		for(i in n) printf("%s\t%s\t%d\t%.4f\n",chrom, i,n[i], n[i]/tot)
	}
' $INFILE | sort -n -b -r -k 3
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
awk '
	/^a/ {seq=0} 
	/^s/ {
		seq++;
		if($5 == "-") {
			start = $6 - $3;
			end = start - $4;
		} else {
			start = $3;
			end = $3 + $4
		}
		printf("%s\t%d\t%d\t%s", $2, start, end, $5);
		if(seq == 1) printf("\t");
		else printf("\n");
	} 

' $1
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/eriqande/make-ancestral-fasta-snakeflow
Name: make-ancestral-fasta-snakeflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...