Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
snake-IBDne
Repo for running custom snakemake for IBDne.
Citations for programs in pipeline:
merge-ibd-segments.16May19.ad5.jar
and
ibdne.04Sep15.e78.jar
:
B L Browning and S R Browning (2013). Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194(2):459-71. doi:10.1534/genetics.113.150029
Running snake-IBDne
Project Folder
Please run each dataset in its own separate folder for clarity.
/share/hennlab/projects/snake-IBDNe/
Required software:
-
snakemake
-
shapeit v2.r904
-
R 3.6.1
-
plink2 1.90p
-
GERMLINE2
-
ibdne.04Sep15.e78.jar (download from here )
-
merge-ibd-segments.16May19.ad5.jar
Setting up conda environment
- For Henn lab users: run this command every time before you use conda:
source /share/hennlab/progs/miniconda3/etc/profile.d/conda.sh
- To create the environment:
conda create -n IBDne-env r-base shapeit plink scipy python=2.7 r-biocmanager
You only need to create the environment the first time, after that you just go straight to the next step (activate it)
- To activate the environment: (every time you run pipeline)
conda activate IBDne-env
Installing required r-packages: (once only, when setting up the environment)
conda install -c bioconda bioconductor-genomicranges
conda install -c r r-data.table
- Once inside the environment, you must manually add the following programs to your path (before each run of the pipeline):
export PATH="/share/hennlab/progs/GERMLINE2-master:$PATH"
Required files in working directory: (Please note - you must run each population in its own directory.)
-
data folder with input data:
-
if data is NOT phased yet: the data folder must contain bim, bed, fam files named in the format {dataset}.bim /bed/fam
-
if data IS phased already: the data folder must contain plink .haps, .sample files for individual chromosomes named in the format {dataset}.chr{chrnum}.phased.haps /.sample
-
-
regions folder with exclude_regions_hg19.txt
-
scripts folder with the following required scripts:
-
bits_matching.sh
-
calc_genome_length.R
-
calculate_concordance.sh
-
choose_params.sh
-
combine_chrom.sh
-
comb_match_files.sh
-
compare_IBD_kinship.R
-
compare_RoH.R
-
define_regions.sh
-
filter_RoH-IBD_segs.R
-
find_lowDensityRegs.R
-
get_outlier_regions.R
-
get_outlier_SNPs.R
-
get_seg_depthv3.R
-
join_RoHsegs.R
-
shapeit_to_germline.py
-
trim_bimfile.R
-
shapeit_iterate.sh
-
-
progs folder with the following java programs:
-
merge-ibd-segments.16May19.ad5.jar
-
ibdne.07May18.6a4.jar
-
-
Snakefile
- the file entitled "ibdne_2versions.smk" is an alternative snakefile for comparing different versions of ibdne. Please use "Snakefile" for regular runs of the pipeline
How to run:
The snakemake command must be run with four config parameters
-
dataset: filename of initial QC'ed bim/bed/fam files (without file extension) ex: if the files are named "Himba_merged.bim" then set dataset=Himba_merged
-
phased: FALSE if input data is not phased, TRUE if input data is phased.
-
gmap-chr_dir: directory containing plink format recombination maps for separate chromosomes.
-
filenames MUST be named in the format
chr{chrnum}.gmap.txt
-
remember to put a
/
at the end of the path. for example:/share/hennlab/reference/snake-IBDne_maps/
-
-
ref: file prefix of the reference population 1000G .haps.gz, .legend, .sample, excluding the _chr{chrnum} portion
-
example: if the files are titled
1000GP_Phase3/1000GP_Phase3_chr6.hap.gz
1000GP_Phase3/1000GP_Phase3_chr6.legend
and1000GP_Phase3/1000GP_Phase3.sample
provide the command line argument as such:ref=1000GP_Phase3/1000GP_Phase3
note that the .hap and .legend files must be named with_chr{number}.hap.gz
and_chr{number}.legend
following the prefix you provide in the ref command line option -
Data downloaded from this link : https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html Data downloaded from this link
-
For Henn lab users: location of these files is
/share/hennlab/reference/1000G_Phase3_haps-sample-legend/
-
Example:
### Dry run - testing workflow
/share/hennlab/progs/miniconda3/bin/snakemake -n --config dataset=xal phased=TRUE ref=/share/hennlab/reference/1000G_Phase3_haps-sample-legend/1000GP_Phase3/1000GP_Phase3 gmap_chr_dir=/share/hennlab/projects/snake-IBDNe/austin_files/ -p -j 10
## Generate DAG of pipeline
/share/hennlab/progs/miniconda3/bin/snakemake --config dataset=xal phased=TRUE ref=/share/hennlab/reference/1000G_Phase3_haps-sample-legend/1000GP_Phase3/1000GP_Phase3 gmap_chr_dir=/share/hennlab/projects/snake-IBDNe/austin_files/ --rulegraph | dot -Tpng > rulegraph.png
## Run the pipeline!
/share/hennlab/progs/miniconda3/bin/snakemake --config dataset=xal phased=TRUE ref=/share/hennlab/reference/1000G_Phase3_haps-sample-legend/1000GP_Phase3/1000GP_Phase3 gmap_chr_dir=/share/hennlab/projects/Xal_snake-IBDne/austin_files/ -p -j 20
Pipeline Overview
Running ibdne_2versions.smk
This snakefile will output results from two versions of IBDne (ibdne.23Apr20.ae9.jar and ibdne.04Sep15.e78.jar) for comparison. To run this version, add the flag
-s ibdne_2versions.smk
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | DATASET=$1 BITS_RANGE=`seq 5 5 200` ERRHOM_RANGE=`seq 1 2` echo Pop hap/notHap bits err_hom err_het diff/total match/total > results/other/${DATASET}_GL_IBD_3500kb.txt for BITS in ${BITS_RANGE[@]} do for ERRHOM in ${ERRHOM_RANGE[@]} do grep -i $DATASET results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd > results/RoH_param_combs/${BITS}/${DATASET}_IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd echo $DATASET hap $BITS $ERRHOM NA `Rscript scripts/compare_RoH.R results/calc_RoH/${DATASET}_RoH_segsJoined_QCed.hom results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd 3500` >> results/other/${DATASET}_GL_IBD_3500kb.txt done done |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | DATASET=$1 BITS_RANGE=`seq 5 5 200` ERRHOM_RANGE=`seq 1 2` echo 'BITS ERRHOM ERRHET NRMSE' > results/other/${DATASET}_NormRMSE.txt GENOME_LENGTH=`Rscript scripts/calc_genome_length.R results/other/${DATASET}_HWE-MAFfiltered.bim` for BITS in ${BITS_RANGE[@]} do for ERRHOM in ${ERRHOM_RANGE[@]} do echo $BITS $ERRHOM NA `Rscript scripts/compare_IBD_kinship.R results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd results/other/${DATASET}.genome $GENOME_LENGTH` >> results/other/${DATASET}_NormRMSE.txt done done |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | IBD_3500=$1 NormRMSE=$2 P=$3 paste -d" " <(awk '{print $0}' $IBD_3500) <(awk '{print $4}' $NormRMSE) > temp awk '{print $1, $2, $3, $4, $5, $6, $7, $8, (1-$6 + $7 + 1-$8)/3}' temp > temp2 sed -r '1s/\S+//9' temp2 > temp sed -i '1s/$/score/' temp # Look only at haploid mv temp ${P}_GL_IBD_3500kb_onlyHap.txt; rm temp2 head -n 1 ${P}_GL_IBD_3500kb_onlyHap.txt > header.txt sed '1d' ${P}_GL_IBD_3500kb_onlyHap.txt > data.txt sort -nk9 data.txt > sorted_data.txt ; rm data.txt cat header.txt sorted_data.txt > ${P}_GL_IBD_3500kb_onlyHap.txt ; rm header.txt sorted_data.txt cat ${P}_GL_IBD_3500kb_onlyHap.txt PARAMS=`tail -n 1 ${P}_GL_IBD_3500kb_onlyHap.txt | cut -f 2-9 -d" "` PARAMS=(${PARAMS//,/ }) echo $PARAMS BITS=${PARAMS[1]} ERRHOM=${PARAMS[2]} PERC_DIFF=${PARAMS[4]} PERC_MATCH=${PARAMS[5]} KIN_RESIDS=${PARAMS[6]} SCORE=${PARAMS[7]} echo -e $P GERMLINE2 parameters"\n"Type: ${TYPE}"\n"Bits: ${BITS}"\n"Err_hom: ${ERRHOM}"\n"Percent difference: ${PERC_DIFF}"\n"Percent match: ${PERC_MATCH}"\n"Kin residuals: ${KIN_RESIDS}"\n"Score: ${SCORE} > results/other/${P}_GL_parameters.txt # awk -v Pop="$P" -v IGNORECASE=1 '$1 ~ Pop && $3 ~ Pop {print}' RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd | cut -f1-7,9 > ${P}_only_QCed.ibd cat results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd > results/other/${P}_only_QCed.ibd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | BITS_RANGE=`seq 5 5 200` # GERMLINE2 IBD segment seed ERRHOM_RANGE=`seq 1 2` CHRS=`seq 1 22` DATASET=$1 for BITS in ${BITS_RANGE[@]} do for ERRHOM in ${ERRHOM_RANGE[@]} do for chr in ${CHRS[@]} do cat results/RoH_param_combs/${BITS}/${DATASET}.chr${chr}.phased_GERMLINE2_RoH_${BITS}_${ERRHOM}_hap.match >> results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap.match done done done |
1 2 3 4 5 6 7 8 9 10 11 12 13 | FILE=$1 DATASET=$2 BITS=$(echo $FILE | cut -d '_' -f5) ERRHOM=$(echo $FILE | cut -d '_' -f6) N=`wc -l $FILE | cut -f1 -d" "` if [ $N -gt 0 ] then Rscript scripts/get_outlier_regions.R results/RoH_param_combs/${BITS}/GERMLINE2_IBDdepth_${BITS}_${ERRHOM}_hap_IBDoutliers 3e6 2.5e5 > results/RoH_param_combs/${BITS}/GERMLINE2_IBDdepth_${BITS}_${ERRHOM}_hap_IBDoutliers_regions cat regions/exclude_regions_hg19.txt results/RoH_param_combs/${BITS}/GERMLINE2_IBDdepth_${BITS}_${ERRHOM}_hap_IBDoutliers_regions results/calc_RoH/${DATASET}_lowDensityRegions.txt > results/RoH_param_combs/${BITS}/exclude_regions_hg19_forIBDNe_${BITS}_${ERRHOM}_hap.txt else cat regions/exclude_regions_hg19.txt results/calc_RoH/${DATASET}_lowDensityRegions.txt > results/RoH_param_combs/${BITS}/exclude_regions_hg19_forIBDNe_${BITS}_${ERRHOM}_hap.txt fi |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | args = commandArgs(trailingOnly=TRUE) # Defaults perc.overlap <- 0.95 # Error handling and updating default values if (length(args)<2) { stop("At least two arguments must be supplied", call.=FALSE) } else if (length(args)==3) { perc.overlap <- as.numeric(args[3]) } else if (length(args)>3) { stop("Too many arguments", call.=FALSE) } options(stringsAsFactors=F) fileNameSplit <- strsplit(args[1], split=".", fixed=T)[[1]] ext.ind <- which(fileNameSplit %in% c("match", "hom", "ibd")) stem <- paste(fileNameSplit[-ext.ind], collapse=".") ext <- fileNameSplit[ext.ind] if (ext=="hom") { seg.file <- read.table(args[1], header=T) colnames(seg.file)[7:8] <- c("start", "end") } else if (ext=="match") { seg.file <- read.table(args[1]) if (ncol(seg.file)==13) { colnames(seg.file)[3:5] <- c("CHR", "start", "end") } else if (ncol(seg.file)==15) { colnames(seg.file)[5:7] <- c("CHR", "start", "end") } } else if (ext=="ibd") { seg.file <- read.table(args[1]) colnames(seg.file)[5:7] <- c("CHR", "start", "end") } excl.regs <- read.table(args[2]) colnames(excl.regs) <- c("CHR", "start", "end", "type") suppressMessages(library("GenomicRanges")) Excl.Regs <- makeGRangesFromDataFrame(excl.regs[,c("CHR", "start", "end")], seqnames.field="CHR") Seg.File <- makeGRangesFromDataFrame(seg.file[ ,c("CHR", "start", "end")], seqnames.field="CHR") Hits <- findOverlaps(Seg.File, Excl.Regs) # find all hits Overlaps <- pintersect(Seg.File[queryHits(Hits)], Excl.Regs[subjectHits(Hits)]) percentOverlap <- width(Overlaps)/width(Seg.File[queryHits(Hits)]) HighOverlapHits <- Hits[percentOverlap > perc.overlap] # Which have high overlap? Some will be removed new.seg.file <- seg.file[-queryHits(HighOverlapHits),] if (ext=="hom") { COLNAMES <- c(colnames(new.seg.file)[1:6], "POS1", "POS2", colnames(new.seg.file)[9:13]) write.table(new.seg.file, "", row.names=F, col.names=COLNAMES, quote=F, sep="\t") } else { write.table(new.seg.file, "", row.names=F, col.names=F, quote=F, sep="\t") } |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | args = commandArgs(trailingOnly=TRUE) # Defaults perc.thresh <- 0.05 # Error handling and updating default values if (length(args)<3) { stop("At least three arguments must be supplied", call.=FALSE) } else if (length(args)==4) { perc.overlap <- as.numeric(args[4]) } else if (length(args)>4) { stop("Too many arguments", call.=FALSE) } options(stringsAsFactors=F) Bim <- read.table(args[1]) WinSize <- as.numeric(args[3]) fileNameSplit <- strsplit(args[1], split=".", fixed=T)[[1]] ext.ind <- match("bim", fileNameSplit) stem <- paste(fileNameSplit[-ext.ind], collapse=".") genome.LDRs <- as.data.frame(matrix(ncol=3, nrow=0)) for (chr in unique(Bim$V1)) { # Calculate SNP density across the chromosome in windows chr.data <- Bim[Bim$V1==chr,] start <- chr.data[1, "V4"] end <- chr.data[nrow(chr.data), "V4"] Hist <- hist(chr.data$V4, breaks=seq(start, end+WinSize, WinSize), plot=F) Counts <- Hist$counts Breaks <- Hist$breaks # Store count information per region chrWinInfo <- as.data.frame(matrix(ncol=4, nrow=length(Counts))) colnames(chrWinInfo) <- c("chr", "start", "end", "nSNPs") chrWinInfo$chr <- chr chrWinInfo$start <- Breaks[1:length(Counts)] chrWinInfo$end <- Breaks[2:length(Breaks)] chrWinInfo$nSNPs <- Counts # Find low density regions min.SNPs <- max(sort(Counts)[length(Counts)*perc.thresh], 1/5e4*WinSize) low.dens.wins <- chrWinInfo[which(chrWinInfo$nSNPs < min.SNPs), 1:3] # If low density regions are found join continguous windows into regions if (nrow(low.dens.wins) > 0) { Low.Density.Regions <- low.dens.wins[1, ] if (nrow(low.dens.wins) > 1) { for (x in 2:nrow(low.dens.wins)) { if (low.dens.wins[x, "start"] == Low.Density.Regions[nrow(Low.Density.Regions), "end"]) { Low.Density.Regions[nrow(Low.Density.Regions), "end"] <- low.dens.wins[x, "end"] } else { Low.Density.Regions[nrow(Low.Density.Regions)+1, ] <- low.dens.wins[x, ] } } } } genome.LDRs <- rbind(genome.LDRs, Low.Density.Regions) } genome.LDRs <- cbind(genome.LDRs, "low_density_region") out_pre=args[2] write.table(genome.LDRs, paste(out_pre, "_lowDensityRegions.txt", sep=""), quote=F, row.names=F, col.names=F, sep="\t") |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | args = commandArgs(trailingOnly=TRUE) # Defaults maxRange <- 3e6 buffer <- 2e5 # Error handling and updating default values if (length(args)<1) { stop("At least one argument must be supplied", call.=FALSE) } else if (length(args)>=2) { maxRange <- as.numeric(args[2]) } else if (length(args)==3) { buffer <- as.numeric(args[3]) } else if (length(args)>3) { stop("Too many arguments", call.=FALSE) } FILE <- read.table(args[1], stringsAsFactors=F) find.IBDoutlier.range <- function(pileup.outliers, maxRange, buffer) { outlier.regions <- as.data.frame(matrix(ncol=4, nrow=0)) colnames(outlier.regions) <- c("CHR", "START", "END", "TYPE") # Split into chromosomes and sort the outlier values (going along the chromosome) chrs <- unique(pileup.outliers[,1]) outliers.by.chr <- list() for (i in 1:length(chrs)) { outliers.by.chr[[i]] <- pileup.outliers[pileup.outliers[,1]==chrs[i],] outliers.by.chr[[i]] <- outliers.by.chr[[i]][order(outliers.by.chr[[i]][,2]),2] } for (i in 1:length(chrs)) { # If there is only one outlier SNP, define as the region +/- the buffer around that SNP if (length(outliers.by.chr[[i]])==1) { outlier.range <- c(outliers.by.chr[[i]][1] - buffer, outliers.by.chr[[i]][1] + buffer) outlier.regions[nrow(outlier.regions)+1,] <- c(chrs[i], outlier.range, "depth_outlier") } # If there is more than one outlier SNP, initiate a new region that begins with the first SNP if (length(outliers.by.chr[[i]])>1) { outlier.range <- c() outlier.range[1] <- outliers.by.chr[[i]][1] - buffer # Find endpoints of the region endpoint.inds <- which(outliers.by.chr[[i]][1:((length(outliers.by.chr[[i]]))-1)] - outliers.by.chr[[i]][-1] < -maxRange) endpoint.inds <- c(endpoint.inds, length(outliers.by.chr[[i]])) for (n in 1:length(endpoint.inds)) { if (n==length(endpoint.inds)) { outlier.range[2] <- outliers.by.chr[[i]][endpoint.inds[n]] + buffer outlier.regions[nrow(outlier.regions)+1,] <- c(chrs[i], outlier.range, "depth_outlier") } else { outlier.range[2] <- outliers.by.chr[[i]][endpoint.inds[n]] + buffer outlier.regions[nrow(outlier.regions)+1,] <- c(chrs[i], outlier.range, "depth_outlier") outlier.range[1] <- outliers.by.chr[[i]][endpoint.inds[n]+1] - buffer } } } } return(outlier.regions) } results <- find.IBDoutlier.range(FILE, maxRange, buffer) write.table(results, "", quote=F, col.names=F, row.names=F, sep=" ") |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | args = commandArgs(trailingOnly=TRUE) n.sds = 3 if (length(args)<1) { stop("One argument must be supplied", call.=FALSE) } else if (length(args)>2) { stop("Too many arguments", call.=FALSE) } else if (length(args)==2) { n.sds = as.numeric(args[2]) } depth.file <- read.table(args[1], sep=" ", header=F) upper <- mean(depth.file[,3]) + (n.sds * sd(depth.file[,3])) outliers <- depth.file[which(depth.file[,3] >= upper),] write.table(outliers, "", quote=F, col.names=F, row.names=F) |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | args = commandArgs(trailingOnly=TRUE) # Check for correct number of arguments if (length(args)<2) { stop("At least two arguments must be supplied", call.=FALSE) } else if (length(args)>2) { stop("Too many arguments", call.=FALSE) } # Define functions pkgTest <- function(x){ if (!require(x,character.only = TRUE)){ install.packages(x,dep=TRUE) if(!require(x,character.only = TRUE)) stop("Package not found") } } # Call packages pkgTest("data.table") # Read in data options(stringsAsFactors=F) fileName <- unlist(strsplit(args[1], split=".", fixed=TRUE)) ext <- fileName[length(fileName)] if (ext == "hom") { seg.file <- read.table(args[1], header=T) colnames(seg.file)[4]<-"chr" colnames(seg.file)[7:8]<-c("segment.start", "segment.end") } else if (ext == "ibd") { seg.file <- read.table(args[1], header=F) colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end") } else if (ext == "match") { seg.file <- read.table(args[1], header=F) if (ncol(seg.file==13)) { colnames(seg.file)[3:5] <- c("chr", "segment.start", "segment.end") } else if (ncol(seg.file==15)) { colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end") } } bim <- read.table(args[2]) colnames(bim)[1:4] <- c("chr","rsid","dist","pos") setDT(seg.file) setDT(bim) bim[, pos2 := pos ] setkey(seg.file, chr, segment.start, segment.end) setkey(bim, chr, pos, pos2) newdepthcounts<-foverlaps(bim, seg.file) [, .(count = sum(!is.na(segment.start))), by = .(chr,pos, pos2) ][, pos2 := NULL ] colnames(newdepthcounts)<-c("CHR","BP","COUNT") write.table(newdepthcounts, "", quote=F, row.names=F, col.names=F, sep=" ") |
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | args = commandArgs(trailingOnly=TRUE) # Check for correct number of arguments if (length(args)<2) { stop("At least two arguments must be supplied", call.=FALSE) } else if (length(args)>2) { stop("Too many arguments", call.=FALSE) } # Define functions pkgTest <- function(x){ if (!require(x,character.only = TRUE, quietly = T)){ install.packages(x,dep=TRUE) if(!require(x,character.only = TRUE, quietly = T)) stop("Package not found") } } # Call packages pkgTest("data.table") pkgTest("GenomicRanges") # Read in data options(stringsAsFactors=F) fileName <- unlist(strsplit(args[1], split=".", fixed=TRUE)) ext <- fileName[length(fileName)] if (ext == "hom") { seg.file <- fread(args[1], header=T) colnames(seg.file)[4]<-"chr" colnames(seg.file)[7:8]<-c("segment.start", "segment.end") } else if (ext == "ibd") { seg.file <- fread(args[1], header=F) colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end") } else if (ext == "match") { seg.file <- fread(args[1], header=F) if (ncol(seg.file==13)) { colnames(seg.file)[3:5] <- c("chr", "segment.start", "segment.end") } else if (ncol(seg.file==15)) { colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end") } } bim <- fread(args[2]) colnames(bim)[1:4] <- c("chr","rsid","dist","start") bim[, end := start ] #set key columns for data.tables setkey(seg.file, chr, segment.start, segment.end) setkey(bim, chr, start, end) #calculate depth counts #assign data.tables to GRanges objects GRseg <- makeGRangesFromDataFrame(seg.file) GRbim <- makeGRangesFromDataFrame(bim) #make new dataframe combining bim site info with counts newdepthcounts <- as.data.frame(bim[,c("chr","start")]) #count number of segments that overlap sites in bim file assign to new column newdepthcounts$count <- countOverlaps(GRbim, GRseg, type = "any") colnames(newdepthcounts)<-c("CHR","BP","COUNT") #assign final column names write.table(newdepthcounts, "", quote=F, row.names=F, col.names=F, sep=" ") #write to out |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("GenomicRanges", quietly = TRUE)) BiocManager::install("GenomicRanges") #!/usr/bin/env Rscript args = commandArgs(trailingOnly=TRUE) # Defaults perc.thresh <- 0.05 perc.overlap <- 0.8 # Error handling and updating default values if (length(args)<3) { stop("At least three arguments must be supplied", call.=FALSE) } else if (length(args)==4) { perc.thresh <- as.numeric(args[4]) } else if (length(args)==5) { perc.thresh <- as.numeric(args[4]) perc.overlap <- as.numeric(args[5]) } else if (length(args)>5) { stop("Too many arguments", call.=FALSE) } options(stringsAsFactors=F) Bim <- read.table(args[1]) WinSize <- as.numeric(args[3]) # Determine type of input segment file cat("Determining input file type... ") fileNameSplit <- strsplit(args[2], split=".", fixed=T)[[1]] ext.ind <- match("hom", fileNameSplit) if (length(fileNameSplit) > 1 & !is.na(ext.ind)) { stem <- paste(fileNameSplit[-ext.ind], collapse=".") ext <- "hom" } else { stem <- args[2] ext="garlic" } if (ext=="hom") { Seg <- read.table(args[2], header=T) if (ncol(Seg)==13) { fileType <- "PLINK" colnames(Seg)[9] <- "LENGTH" } cat(fileType, "file type detected\n") } else { Seg <- read.table(args[2], skip=1) Seg <- cbind(Seg, rep("placeholder_ID", nrow(Seg))) if (ncol(Seg)==10) { fileType <- "GARLIC" cat("WARNING: GARLIC file type detected - header line dropped\n") colnames(Seg)[c(1,2,3,4,5,10)] <- c("CHR","POS1","POS2","CLASS","LENGTH","IID") for (i in 1:22) { Seg[which(Seg$CHR==paste("chr", i, sep="")), "CHR"] <- i } Seg$CHR <- as.numeric(Seg$CHR) class.ranges <- list(); length(class.ranges) <- 3; names(class.ranges) <- c("A","B","C") for (class in c("A","B","C")) { class.ranges[[class]] <- range(as.numeric(Seg[which(Seg$CLASS==class), "LENGTH"])) } class.ranges[["A"]][2] <- mean(class.ranges[["A"]][2], class.ranges[["B"]][1]) class.ranges[["B"]][1] <- mean(class.ranges[["A"]][2], class.ranges[["B"]][1]) + 0.5 class.ranges[["B"]][2] <- mean(class.ranges[["B"]][2], class.ranges[["C"]][1]) class.ranges[["C"]][1] <- mean(class.ranges[["B"]][2], class.ranges[["C"]][1]) + 0.5 } } # Create copy of input file - eventual output file Seg.new <- Seg genome.SNP.count <- c() chr.hists <- list() length(chr.hists) <- length(unique(Bim$V1)) names(chr.hists) <- unique(Bim$V1) for (chr in unique(Bim$V1)) { chr.data <- Bim[Bim$V1==chr,] start <- chr.data[1, "V4"] end <- chr.data[nrow(chr.data), "V4"] chr.hists[[as.character(chr)]] <- hist(chr.data$V4, breaks=seq(start, end+WinSize, WinSize), plot=F) genome.SNP.count <- c(genome.SNP.count, chr.hists[[as.character(chr)]]$counts) } min.SNPs <- max(sort(genome.SNP.count)[length(genome.SNP.count)*perc.thresh], 1/5e4*WinSize) suppressMessages(library("GenomicRanges")) for (chr in sort(unique(Seg$CHR))) { cat("Starting chr", chr, "\n") # Calculate SNP density across the chromosome in windows Hist <- chr.hists[[as.character(chr)]] Counts <- Hist$counts Breaks <- Hist$breaks # Store count information per region if (fileType=="PLINK") { chrWinInfo <- as.data.frame(matrix(ncol=4, nrow=length(Counts))) colnames(chrWinInfo) <- c("chr", "start", "end", "nSNPs") } else if (fileType=="GARLIC") { chrWinInfo <- as.data.frame(matrix(ncol=3, nrow=length(Counts))) colnames(chrWinInfo) <- c("chr", "start", "end") } chrWinInfo$chr <- chr chrWinInfo$start <- Breaks[1:length(Counts)] chrWinInfo$end <- Breaks[2:length(Breaks)] chrWinInfo$nSNPs <- Counts # Find low density regions low.dens.wins <- chrWinInfo[which(chrWinInfo$nSNPs < min.SNPs), 1:3] # If low density regions are found join continguous windows into regions if (nrow(low.dens.wins) > 0) { Low.Density.Regions <- low.dens.wins[1, ] if (nrow(low.dens.wins) > 1) { for (x in 2:nrow(low.dens.wins)) { if (low.dens.wins[x, "start"] == Low.Density.Regions[nrow(Low.Density.Regions), "end"]) { Low.Density.Regions[nrow(Low.Density.Regions), "end"] <- low.dens.wins[x, "end"] } else { Low.Density.Regions[nrow(Low.Density.Regions)+1, ] <- low.dens.wins[x, ] } } } Low.Density.Ranges <- makeGRangesFromDataFrame(Low.Density.Regions) # Join segments in input file for (Ind in unique(Seg$IID)) { chrSegInd <- Seg[Seg$CHR==chr&Seg$IID==Ind,] chrSegInd <- chrSegInd[order(chrSegInd$POS1),] if (nrow(chrSegInd)>1) { RoH.gaps <- as.data.frame(matrix(ncol=3,nrow=0)) colnames(RoH.gaps) <- c("chr","start","end") for (y in 2:max(2, nrow(chrSegInd))) { RoH.gaps[y-1, ] <- c(chr, chrSegInd[y-1, "POS2"], chrSegInd[y, "POS1"]) } RoH.gaps <- RoH.gaps[which(RoH.gaps$end > RoH.gaps$start),] if (nrow(RoH.gaps) > 0) { RoH.gap.Ranges <- makeGRangesFromDataFrame(RoH.gaps) # Do any gaps overlap a low density region? Hits <- findOverlaps(RoH.gap.Ranges, Low.Density.Ranges) # find all hits Overlaps <- pintersect(Low.Density.Ranges[subjectHits(Hits)], RoH.gap.Ranges[queryHits(Hits)]) percentOverlap <- width(Overlaps) / width(RoH.gap.Ranges[queryHits(Hits)]) # calculate overlap Hits <- Hits[percentOverlap > perc.overlap] # sufficient overlap? joinOverGaps <- RoH.gaps[queryHits(Hits),] if (nrow(joinOverGaps)>0) { for (z in 1:nrow(joinOverGaps)) { seg1 <- rownames(chrSegInd[which(chrSegInd$POS2 == joinOverGaps[z, "start"]),])[1] seg2 <- rownames(chrSegInd[which(chrSegInd$POS1 == joinOverGaps[z, "end"]),])[1] joinedSeg <- chrSegInd[as.character(seg1),] joinedSeg$POS2 <- chrSegInd[as.character(seg2),"POS2"] joinedSeg$LENGTH <- chrSegInd[as.character(seg1),"LENGTH"] + chrSegInd[as.character(seg2),"LENGTH"] if (fileType=="PLINK") { joinedSeg$SNP2 <- chrSegInd[as.character(seg2),"SNP2"] joinedSeg$NSNP <- chrSegInd[as.character(seg1),"NSNP"] + chrSegInd[as.character(seg2),"NSNP"] } else if (fileType=="GARLIC") { if (joinedSeg$LENGTH < class.ranges[["A"]][1]) { joinedSeg$CLASS <- "A" } else if (joinedSeg$LENGTH >= class.ranges[["A"]][1] & joinedSeg$LENGTH <= class.ranges[["A"]][2]) { joinedSeg$CLASS <- "A" } else if (joinedSeg$LENGTH >= class.ranges[["B"]][1] & joinedSeg$LENGTH <= class.ranges[["B"]][2]) { joinedSeg$CLASS <- "B" } else if (joinedSeg$LENGTH >= class.ranges[["C"]][1] & joinedSeg$LENGTH <= class.ranges[["C"]][2]) { joinedSeg$CLASS <- "C" } else if (joinedSeg$LENGTH > class.ranges[["C"]][2]) { joinedSeg$CLASS <- "C" } } Seg.new <- rbind(Seg.new, joinedSeg) # Remove the two split segments from the Seg.joined object Seg.new <- Seg.new[-match(as.character(seg1), rownames(Seg.new)),] Seg.new <- Seg.new[-match(as.character(seg2), rownames(Seg.new)),] } } } } } } } if (fileType=="PLINK") { colnames(Seg.new)[9] <- "KB" write.table(Seg.new, paste(stem, "_segsJoined.hom", sep=""), quote=F, row.names=F, col.names=T) } else if (fileType=="GARLIC") { Seg.new <- Seg.new[,-ncol(Seg.new)] Seg.new$CHR <- paste("chr", Seg.new$CHR, sep="") write.table(Seg.new, paste(stem, "_segsJoined.garlic", sep=""), quote=F, row.names=F, col.names=F, sep="\t") } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | INPUT=$1 MAP=$2 LOG=$3 OUTPUT=$4 REF_HAP=$5 REF_LEG=$6 REF_SAMP=$7 # ShapeIT phasing and conversion to plink files #---------------------------------------------- # Divide plink file by chromosome, trim SNPs that lie outside the boundaries of the genetic map # Divide plink file by chromosome, trim SNPs that lie outside the boundaries of the genetic map Rscript scripts/trim_bimfile.R ${INPUT} ${MAP} plink --bfile ${INPUT} --exclude ${INPUT}_removeSNPs --make-bed --out ${INPUT}_endsTrimmed #cat ${FILE}.chr${chr}_endsTrimmed.bim >> dat_QCed_endsTrimmed.bim # Run SHAPEIT shapeit -B ${INPUT}_endsTrimmed -M ${MAP} --input-ref ${REF_HAP} ${REF_LEG} ${REF_SAMP} --duohmm -W 5 -O ${OUTPUT} --output-log ${LOG}.log -T 4 files=( ${LOG}'.snp.strand' ) if (( ${#files[@]} )); then N=`ls -l ${LOG}.snp.strand | wc -l` while [ $N -gt 0 ] do grep Missing ${LOG}.snp.strand | cut -f4 > ${INPUT}_missing_from_ref.txt grep Strand ${LOG}.snp.strand | awk '$11==1 {print $4}' > ${INPUT}_flip_in_dataset.txt grep Strand ${LOG}.snp.strand | awk '$11==0 {print $4}' > ${INPUT}_cannot_flip_in_dataset.txt cat ${INPUT}_missing_from_ref.txt ${INPUT}_cannot_flip_in_dataset.txt > ${INPUT}_remove_from_dataset.txt rm -f ${LOG}.snp.strand* rm -f ${INPUT}_missing_from_ref.txt ${INPUT}_cannot_flip_in_dataset.txt plink --bfile ${INPUT}_endsTrimmed --flip ${INPUT}_flip_in_dataset.txt --exclude ${INPUT}_remove_from_dataset.txt --make-bed --out ${INPUT}_endsTrimmed_misalignRepaired shapeit -B ${INPUT}_endsTrimmed_misalignRepaired -M ${MAP} --input-ref ${REF_HAP} ${REF_LEG} ${REF_SAMP} --duohmm -W 5 -O ${OUTPUT} --output-log ${LOG}.log -T 4 N=`ls -l ${LOG}.snp.strand | wc -l` done fi |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | from numpy import * from scipy.interpolate import interp1d from sys import argv # use, e.g.: # python shapeit_to_germline.py shapeitoutputprefix geneticmapfile # python shapeit_to_germline.py Ethiopians_chr21_SHAPEITphased_duoHMM genetic_map_chr21_combined_b37.txt prefix = argv[-2] hapsfile = prefix + '.haps' samplefile = prefix + '.sample' hapmapfile = argv[-1] famdata = [line.strip().split() for line in file(samplefile)][2:] # read in haplotypes print 'reading in haps data...' hapsdata = [line.strip().split() for line in file(hapsfile)] print 'splitting haps data' mapinfo = array([line[:3] for line in hapsdata]) variants = array([line[3:5] for line in hapsdata]) alleles = array([line[5:] for line in hapsdata],dtype=int) #hapsdata = array([line.strip().split()[:3] for line in file(hapsfile)]) bps = array( mapinfo[:,2] , dtype=int) # read in hapmap file mapdata = array([line.strip().split() for line in file(hapmapfile)][2:],dtype=float) happos = mapdata[:,0] hapcms = mapdata[:,2] # now create interpolation function f = interp1d(happos,hapcms,fill_value='extrapolate') bpcms = f(bps) # make map file outmap = file(prefix + '.map' , 'w') print 'writing mapfile to %s' % (outmap) for i, line in enumerate(mapinfo): outmap.write('%s\t%s\t%s\t%s\n' % (line[0],line[1],bpcms[i],line[2])) # make ped file. More annoying for sure! # brute force for now, hopefully it runs okay (not huge files) outped = file(prefix + '.ped' , 'w') print 'writing pedfile to %s' % (outped) alleles_acgt = array([variants[i][alleles[i]] for i in range(len(variants))]) for i, line in enumerate(famdata): if i % 20 == 0: print 'on individual %s' % (i) # zip together haplotypes 2*i and 2*i+1 so they're on the same line plinkalleles = array(ndarray.flatten(array(zip(alleles_acgt[:,2*i],alleles_acgt[:,2*i+1] ) ) ), dtype='S1') outped.write('%s %s\n' % ( ' '.join(line[:2] + line[3:]) , ' '.join(plinkalleles))) outmap.close() outped.close() print 'done' |
81 82 83 84 | shell: """ plink --bfile data/{wildcards.dataset} --chr {wildcards.chrnum} --make-bed --out {params} """ |
99 100 101 102 | shell: """ bash scripts/shapeit_iterate.sh {params.in_pre} {params.inputmap} {params.log_pre} {params.out_pre} {REF}_chr{wildcards.chrnum}.hap.gz {REF}_chr{wildcards.chrnum}.legend.gz {REF}.sample """ |
114 115 116 117 | shell: """ python scripts/shapeit_to_germline.py {params.prefix} {params.gmap} """ |
129 130 131 132 | shell: """ shapeit -convert --input-haps {params.in_pre} --output-vcf {output.vcf} --output-log {output.log} """ |
141 142 143 144 145 | shell: """ sed -n -e '/^#/ p' prepare/chroms_phased_vcf/{wildcards.dataset}.chr1.phased.vcf > {output} #initialize allchr.vcf with headers for i in {input}; do sed '/^#/d' $i >> {output} ; done """ |
154 155 156 157 | shell: """ plink --vcf {input} --recode --out {params} """ |
168 169 170 171 | shell: """ plink --file {params.in_pre} --cm-map {params.gmap} --recode --out {params.out_pre} """ |
182 183 184 185 | shell: """ plink --vcf {input} --make-just-bim -out {params.out_pre} """ |
196 197 198 199 | shell: """ Rscript scripts/find_lowDensityRegs.R {input} {params} 1e6 0.05 """ |
210 211 212 213 | shell: """ gawk ' {{ t = $1; $1 = $2; $2 = t; print; }} ' {input} > {output} """ |
223 224 225 226 | shell: """ GERMLINE2 -mapfile {input.map} -pedfile {input.ped} -outfile {params.prefix} -err_hom {wildcards.err} -bits {wildcards.bits} -w_extend -haploid """ |
233 234 235 236 | shell: """ scripts/comb_match_files.sh {DATASET} """ |
243 244 245 246 | shell: """ awk '{{print $1 , "0" , $2 , "0" , $3 , $4 , $5, "20"}}' {input} > {output} """ |
255 256 257 258 | shell: """ cat {input.ibd} | java -jar progs/merge-ibd-segments.16May19.ad5.jar {input.vcf} {input.map} 0.6 1 | awk '{{print $1, $2, $3, $4, $5, $6, $7, $9}}' > {output} """ |
267 268 269 270 | shell: """ Rscript scripts/get_seg_depthv3.R {input.ibd} {input.bim} > {output} """ |
277 278 279 280 | shell: """ Rscript scripts/get_outlier_SNPs.R {input} 3 > {output} """ |
287 288 289 290 | shell: """ bash scripts/define_regions.sh {input} {DATASET} """ |
298 299 300 301 | shell: """ Rscript scripts/filter_RoH-IBD_segs.R {input.ibd} {input.txt} 0.85 > {output} """ |
317 318 319 320 | shell: """ plink --bfile {params.in_pre} --homozyg --homozyg-snp {N_SNPS} --homozyg-window-missing {MISSING} --homozyg-window-het {HET} --homozyg-kb {RoH_LEN} --out {params.out_pre} """ |
330 331 332 333 | shell: """ Rscript scripts/join_RoHsegs.R {input.bim} {input.hom} 1e6 0.05 0.8 """ |
349 350 351 352 353 | shell: """ Rscript scripts/find_lowDensityRegs.R {input} {params.out_pre} 1e6 0.05 cat regions/exclude_regions_hg19.txt {output.ldr} > {output.reg} """ |
363 364 365 366 | shell: """ Rscript scripts/get_seg_depthv2.R {input.hom} {input.bim} > {output} """ |
378 379 380 381 382 383 | shell: """ Rscript scripts/get_outlier_SNPs.R {input.roh} > {output.outliers} Rscript scripts/get_outlier_regions.R {output.outliers} > {output.regions} cat {input.txt} {output.regions} > {output.txt} """ |
393 394 395 396 | shell: """ Rscript scripts/filter_RoH-IBD_segs.R {input.hom} {input.txt} 0.85 > {output} """ |
408 409 410 411 | shell: """ bash scripts/bits_matching.sh {DATASET} """ |
427 428 429 430 431 | shell: """ plink --bfile {params.pre_one} --maf {MAF} --make-bed --out {params.pre_two}_HWE-MAFfiltered plink --bfile {params.pre_two}_HWE-MAFfiltered --genome --out {params.pre_two} """ |
446 447 448 449 | shell: """ bash scripts/calculate_concordance.sh {DATASET} """ |
464 465 466 467 | shell: """ bash scripts/choose_params.sh {input.ibd} {input.norm} {wildcards.dataset} """ |
476 477 478 479 | shell: """ cat {input.ibd} | java -jar progs/ibdne.04Sep15.e78.jar map={input.map} out=results/IBDne_output/{wildcards.dataset}_GERMLINE2_IBDNe_minibd-{cM} minibd={cM} nthreads={IBDne_THREADs} nboots=80 gmax={GMAX} """ |
Support
- Future updates
Related Workflows





