Repo for running custom snakemake for IBDne

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

snake-IBDne

Repo for running custom snakemake for IBDne.

Citations for programs in pipeline:

merge-ibd-segments.16May19.ad5.jar and ibdne.04Sep15.e78.jar :

B L Browning and S R Browning (2013). Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194(2):459-71. doi:10.1534/genetics.113.150029

Running snake-IBDne

Project Folder

Please run each dataset in its own separate folder for clarity.

/share/hennlab/projects/snake-IBDNe/

Required software:

snakemake
shapeit v2.r904
R 3.6.1
plink2 1.90p
GERMLINE2
ibdne.04Sep15.e78.jar (download from here )
merge-ibd-segments.16May19.ad5.jar

Setting up conda environment

For Henn lab users: run this command every time before you use conda:

source /share/hennlab/progs/miniconda3/etc/profile.d/conda.sh

To create the environment:

conda create -n IBDne-env r-base shapeit plink scipy python=2.7 r-biocmanager

You only need to create the environment the first time, after that you just go straight to the next step (activate it)

To activate the environment: (every time you run pipeline)

conda activate IBDne-env

Installing required r-packages: (once only, when setting up the environment)

conda install -c bioconda bioconductor-genomicranges
conda install -c r r-data.table

Once inside the environment, you must manually add the following programs to your path (before each run of the pipeline):

export PATH="/share/hennlab/progs/GERMLINE2-master:$PATH"

Required files in working directory: (Please note - you must run each population in its own directory.)

data folder with input data:
- if data is NOT phased yet: the data folder must contain bim, bed, fam files named in the format {dataset}.bim /bed/fam
- if data IS phased already: the data folder must contain plink .haps, .sample files for individual chromosomes named in the format {dataset}.chr{chrnum}.phased.haps /.sample
regions folder with exclude_regions_hg19.txt
scripts folder with the following required scripts:
- bits_matching.sh
- calc_genome_length.R
- calculate_concordance.sh
- choose_params.sh
- combine_chrom.sh
- comb_match_files.sh
- compare_IBD_kinship.R
- compare_RoH.R
- define_regions.sh
- filter_RoH-IBD_segs.R
- find_lowDensityRegs.R
- get_outlier_regions.R
- get_outlier_SNPs.R
- get_seg_depthv3.R
- join_RoHsegs.R
- shapeit_to_germline.py
- trim_bimfile.R
- shapeit_iterate.sh
progs folder with the following java programs:
- merge-ibd-segments.16May19.ad5.jar
- ibdne.07May18.6a4.jar
Snakefile
- the file entitled "ibdne_2versions.smk" is an alternative snakefile for comparing different versions of ibdne. Please use "Snakefile" for regular runs of the pipeline

How to run:

The snakemake command must be run with four config parameters

dataset: filename of initial QC'ed bim/bed/fam files (without file extension) ex: if the files are named "Himba_merged.bim" then set dataset=Himba_merged
phased: FALSE if input data is not phased, TRUE if input data is phased.
gmap-chr_dir: directory containing plink format recombination maps for separate chromosomes.
- filenames MUST be named in the format chr{chrnum}.gmap.txt
- remember to put a / at the end of the path. for example: /share/hennlab/reference/snake-IBDne_maps/
ref: file prefix of the reference population 1000G .haps.gz, .legend, .sample, excluding the _chr{chrnum} portion
- example: if the files are titled 1000GP_Phase3/1000GP_Phase3_chr6.hap.gz 1000GP_Phase3/1000GP_Phase3_chr6.legend and 1000GP_Phase3/1000GP_Phase3.sample provide the command line argument as such: ref=1000GP_Phase3/1000GP_Phase3 note that the .hap and .legend files must be named with _chr{number}.hap.gz and _chr{number}.legend following the prefix you provide in the ref command line option
- Data downloaded from this link : https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html Data downloaded from this link
- For Henn lab users: location of these files is /share/hennlab/reference/1000G_Phase3_haps-sample-legend/

Example:

### Dry run - testing workflow
/share/hennlab/progs/miniconda3/bin/snakemake -n --config dataset=xal phased=TRUE ref=/share/hennlab/reference/1000G_Phase3_haps-sample-legend/1000GP_Phase3/1000GP_Phase3 gmap_chr_dir=/share/hennlab/projects/snake-IBDNe/austin_files/ -p -j 10
## Generate DAG of pipeline
/share/hennlab/progs/miniconda3/bin/snakemake --config dataset=xal phased=TRUE ref=/share/hennlab/reference/1000G_Phase3_haps-sample-legend/1000GP_Phase3/1000GP_Phase3 gmap_chr_dir=/share/hennlab/projects/snake-IBDNe/austin_files/ --rulegraph | dot -Tpng > rulegraph.png
## Run the pipeline!
/share/hennlab/progs/miniconda3/bin/snakemake --config dataset=xal phased=TRUE ref=/share/hennlab/reference/1000G_Phase3_haps-sample-legend/1000GP_Phase3/1000GP_Phase3 gmap_chr_dir=/share/hennlab/projects/Xal_snake-IBDne/austin_files/ -p -j 20

Pipeline Overview

DAG

Running ibdne_2versions.smk

This snakefile will output results from two versions of IBDne (ibdne.23Apr20.ae9.jar and ibdne.04Sep15.e78.jar) for comparison. To run this version, add the flag -s ibdne_2versions.smk

Code Snippets

DATASET=$1
BITS_RANGE=`seq 5 5 200` 
ERRHOM_RANGE=`seq 1 2`


echo Pop hap/notHap bits err_hom err_het diff/total match/total > results/other/${DATASET}_GL_IBD_3500kb.txt


for BITS in ${BITS_RANGE[@]}
  do
    for ERRHOM in ${ERRHOM_RANGE[@]}
      do
        grep -i $DATASET results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd > results/RoH_param_combs/${BITS}/${DATASET}_IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd
        echo $DATASET hap $BITS $ERRHOM NA `Rscript scripts/compare_RoH.R results/calc_RoH/${DATASET}_RoH_segsJoined_QCed.hom results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd 3500` >> results/other/${DATASET}_GL_IBD_3500kb.txt
      done
  done

Shell From line 1 of scripts/bits_matching.sh

DATASET=$1
BITS_RANGE=`seq 5 5 200`
ERRHOM_RANGE=`seq 1 2`

echo 'BITS ERRHOM ERRHET NRMSE' > results/other/${DATASET}_NormRMSE.txt
GENOME_LENGTH=`Rscript scripts/calc_genome_length.R results/other/${DATASET}_HWE-MAFfiltered.bim`

for BITS in ${BITS_RANGE[@]}
  do
    for ERRHOM in ${ERRHOM_RANGE[@]}
      do
        echo $BITS $ERRHOM NA `Rscript scripts/compare_IBD_kinship.R results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd results/other/${DATASET}.genome $GENOME_LENGTH` >> results/other/${DATASET}_NormRMSE.txt
      done
  done

Shell From line 1 of scripts/calculate_concordance.sh

IBD_3500=$1
NormRMSE=$2
P=$3

paste -d" " <(awk '{print $0}' $IBD_3500) <(awk '{print $4}' $NormRMSE) > temp
awk '{print $1, $2, $3, $4, $5, $6, $7, $8, (1-$6 + $7 + 1-$8)/3}' temp > temp2
sed -r '1s/\S+//9' temp2 > temp
sed -i '1s/$/score/' temp
# Look only at haploid
mv temp ${P}_GL_IBD_3500kb_onlyHap.txt; rm temp2

head -n 1 ${P}_GL_IBD_3500kb_onlyHap.txt > header.txt
sed '1d' ${P}_GL_IBD_3500kb_onlyHap.txt > data.txt

sort -nk9 data.txt > sorted_data.txt ; rm data.txt
cat header.txt sorted_data.txt > ${P}_GL_IBD_3500kb_onlyHap.txt ; rm header.txt sorted_data.txt

cat ${P}_GL_IBD_3500kb_onlyHap.txt

PARAMS=`tail -n 1 ${P}_GL_IBD_3500kb_onlyHap.txt | cut -f 2-9 -d" "`
PARAMS=(${PARAMS//,/ })
echo $PARAMS
BITS=${PARAMS[1]}
ERRHOM=${PARAMS[2]}
PERC_DIFF=${PARAMS[4]}
PERC_MATCH=${PARAMS[5]}
KIN_RESIDS=${PARAMS[6]}
SCORE=${PARAMS[7]}

echo -e $P GERMLINE2 parameters"\n"Type: ${TYPE}"\n"Bits: ${BITS}"\n"Err_hom: ${ERRHOM}"\n"Percent difference: ${PERC_DIFF}"\n"Percent match: ${PERC_MATCH}"\n"Kin residuals: ${KIN_RESIDS}"\n"Score: ${SCORE} > results/other/${P}_GL_parameters.txt
# awk -v Pop="$P" -v IGNORECASE=1 '$1 ~ Pop && $3 ~ Pop {print}' RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd | cut -f1-7,9 > ${P}_only_QCed.ibd


cat results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap_segsJoined_QCed.ibd > results/other/${P}_only_QCed.ibd

Shell GERMLINE From line 7 of scripts/choose_params.sh

BITS_RANGE=`seq 5 5 200` # GERMLINE2 IBD segment seed
ERRHOM_RANGE=`seq 1 2`
CHRS=`seq 1 22`
DATASET=$1

for BITS in ${BITS_RANGE[@]}
  do 
    for ERRHOM in ${ERRHOM_RANGE[@]}
      do 
        for chr in ${CHRS[@]}
          do
            cat results/RoH_param_combs/${BITS}/${DATASET}.chr${chr}.phased_GERMLINE2_RoH_${BITS}_${ERRHOM}_hap.match >> results/RoH_param_combs/${BITS}/IBD_${BITS}_${ERRHOM}_hap.match
          done 
        done 
    done

Shell GERMLINE From line 1 of scripts/comb_match_files.sh

FILE=$1
DATASET=$2
BITS=$(echo $FILE | cut -d '_' -f5)
ERRHOM=$(echo $FILE | cut -d '_' -f6)

N=`wc -l $FILE | cut -f1 -d" "`
if [ $N -gt 0 ]
  then
    Rscript scripts/get_outlier_regions.R results/RoH_param_combs/${BITS}/GERMLINE2_IBDdepth_${BITS}_${ERRHOM}_hap_IBDoutliers 3e6 2.5e5 > results/RoH_param_combs/${BITS}/GERMLINE2_IBDdepth_${BITS}_${ERRHOM}_hap_IBDoutliers_regions
    cat regions/exclude_regions_hg19.txt results/RoH_param_combs/${BITS}/GERMLINE2_IBDdepth_${BITS}_${ERRHOM}_hap_IBDoutliers_regions  results/calc_RoH/${DATASET}_lowDensityRegions.txt > results/RoH_param_combs/${BITS}/exclude_regions_hg19_forIBDNe_${BITS}_${ERRHOM}_hap.txt
  else
    cat regions/exclude_regions_hg19.txt results/calc_RoH/${DATASET}_lowDensityRegions.txt > results/RoH_param_combs/${BITS}/exclude_regions_hg19_forIBDNe_${BITS}_${ERRHOM}_hap.txt
  fi

Shell From line 1 of scripts/define_regions.sh

args = commandArgs(trailingOnly=TRUE)

# Defaults
perc.overlap <- 0.95

# Error handling and updating default values
if (length(args)<2) {
  stop("At least two arguments must be supplied", call.=FALSE)
} else if (length(args)==3) {
  perc.overlap <- as.numeric(args[3])
} else if (length(args)>3) {
  stop("Too many arguments", call.=FALSE)
}

options(stringsAsFactors=F)
fileNameSplit <- strsplit(args[1], split=".", fixed=T)[[1]]
ext.ind <- which(fileNameSplit %in% c("match", "hom", "ibd"))
stem <- paste(fileNameSplit[-ext.ind], collapse=".")
ext <- fileNameSplit[ext.ind]

if (ext=="hom") {
  seg.file <- read.table(args[1], header=T)
  colnames(seg.file)[7:8] <- c("start", "end")
} else if (ext=="match") {
  seg.file <- read.table(args[1])
  if (ncol(seg.file)==13) {
    colnames(seg.file)[3:5] <- c("CHR", "start", "end")
  } else if (ncol(seg.file)==15) {
    colnames(seg.file)[5:7] <- c("CHR", "start", "end")
  }
} else if (ext=="ibd") {
  seg.file <- read.table(args[1])
  colnames(seg.file)[5:7] <- c("CHR", "start", "end")
}

excl.regs <- read.table(args[2])
  colnames(excl.regs) <- c("CHR", "start", "end", "type")

suppressMessages(library("GenomicRanges"))
Excl.Regs <- makeGRangesFromDataFrame(excl.regs[,c("CHR", "start", "end")], seqnames.field="CHR")
Seg.File <- makeGRangesFromDataFrame(seg.file[ ,c("CHR", "start", "end")], seqnames.field="CHR")

Hits <- findOverlaps(Seg.File, Excl.Regs) # find all hits
Overlaps <- pintersect(Seg.File[queryHits(Hits)], Excl.Regs[subjectHits(Hits)])
percentOverlap <- width(Overlaps)/width(Seg.File[queryHits(Hits)])
HighOverlapHits <- Hits[percentOverlap > perc.overlap] # Which have high overlap? Some will be removed

new.seg.file <- seg.file[-queryHits(HighOverlapHits),]

if (ext=="hom") {
  COLNAMES <- c(colnames(new.seg.file)[1:6], "POS1", "POS2", colnames(new.seg.file)[9:13])
  write.table(new.seg.file, "", row.names=F, col.names=COLNAMES, quote=F, sep="\t")
} else {
  write.table(new.seg.file, "", row.names=F, col.names=F, quote=F, sep="\t")
}

R GenomicRanges From line 11 of scripts/filter_RoH-IBD_segs.R

args = commandArgs(trailingOnly=TRUE)


# Defaults
perc.thresh <- 0.05

# Error handling and updating default values
if (length(args)<3) {
  stop("At least three arguments must be supplied", call.=FALSE)
} else if (length(args)==4) {
  perc.overlap <- as.numeric(args[4])
} else if (length(args)>4) {
  stop("Too many arguments", call.=FALSE)
}

options(stringsAsFactors=F)
Bim <- read.table(args[1])
WinSize <- as.numeric(args[3])

fileNameSplit <- strsplit(args[1], split=".", fixed=T)[[1]]
ext.ind <- match("bim", fileNameSplit)
stem <- paste(fileNameSplit[-ext.ind], collapse=".")

genome.LDRs <- as.data.frame(matrix(ncol=3, nrow=0))
for (chr in unique(Bim$V1)) {
  # Calculate SNP density across the chromosome in windows
  chr.data <- Bim[Bim$V1==chr,]
  start <- chr.data[1, "V4"]
  end <- chr.data[nrow(chr.data), "V4"]
  Hist <- hist(chr.data$V4, breaks=seq(start, end+WinSize, WinSize), plot=F)
  Counts <- Hist$counts
  Breaks <- Hist$breaks
  # Store count information per region
  chrWinInfo <- as.data.frame(matrix(ncol=4, nrow=length(Counts)))
  colnames(chrWinInfo) <- c("chr", "start", "end", "nSNPs")
  chrWinInfo$chr <- chr
  chrWinInfo$start <- Breaks[1:length(Counts)]
  chrWinInfo$end <- Breaks[2:length(Breaks)]
  chrWinInfo$nSNPs <- Counts
  # Find low density regions
  min.SNPs <- max(sort(Counts)[length(Counts)*perc.thresh], 1/5e4*WinSize)
  low.dens.wins <- chrWinInfo[which(chrWinInfo$nSNPs < min.SNPs), 1:3]
  # If low density regions are found join continguous windows into regions
  if (nrow(low.dens.wins) > 0) {
    Low.Density.Regions <- low.dens.wins[1, ]
    if (nrow(low.dens.wins) > 1) {
      for (x in 2:nrow(low.dens.wins)) {
        if (low.dens.wins[x, "start"] == Low.Density.Regions[nrow(Low.Density.Regions), "end"]) {
          Low.Density.Regions[nrow(Low.Density.Regions), "end"] <- low.dens.wins[x, "end"]
        } else {
          Low.Density.Regions[nrow(Low.Density.Regions)+1, ] <- low.dens.wins[x, ]
        }
      }
    }
  }
  genome.LDRs <- rbind(genome.LDRs, Low.Density.Regions)
}

genome.LDRs <- cbind(genome.LDRs, "low_density_region")

out_pre=args[2]

write.table(genome.LDRs, paste(out_pre, "_lowDensityRegions.txt", sep=""), quote=F, row.names=F, col.names=F, sep="\t")

R From line 12 of scripts/find_lowDensityRegs.R

args = commandArgs(trailingOnly=TRUE)

# Defaults
maxRange <- 3e6
buffer <- 2e5

# Error handling and updating default values
if (length(args)<1) {
  stop("At least one argument must be supplied", call.=FALSE)
} else if (length(args)>=2) {
  maxRange <- as.numeric(args[2])
} else if (length(args)==3) {
  buffer <- as.numeric(args[3])
} else if (length(args)>3) {
  stop("Too many arguments", call.=FALSE)
}

FILE <- read.table(args[1], stringsAsFactors=F)

find.IBDoutlier.range <- function(pileup.outliers, maxRange, buffer) {
  outlier.regions <- as.data.frame(matrix(ncol=4, nrow=0))
    colnames(outlier.regions) <- c("CHR", "START", "END", "TYPE")
  # Split into chromosomes and sort the outlier values (going along the chromosome)
  chrs <- unique(pileup.outliers[,1])
  outliers.by.chr <- list()
  for (i in 1:length(chrs)) {
    outliers.by.chr[[i]] <- pileup.outliers[pileup.outliers[,1]==chrs[i],]
    outliers.by.chr[[i]] <- outliers.by.chr[[i]][order(outliers.by.chr[[i]][,2]),2]
  }
  for (i in 1:length(chrs)) {
    # If there is only one outlier SNP, define as the region +/- the buffer around that SNP
    if (length(outliers.by.chr[[i]])==1) {
      outlier.range <- c(outliers.by.chr[[i]][1] - buffer, outliers.by.chr[[i]][1] + buffer)
      outlier.regions[nrow(outlier.regions)+1,] <- c(chrs[i], outlier.range, "depth_outlier")
    }
    # If there is more than one outlier SNP, initiate a new region that begins with the first SNP
    if (length(outliers.by.chr[[i]])>1) {
      outlier.range <- c()
      outlier.range[1] <- outliers.by.chr[[i]][1] - buffer
      # Find endpoints of the region
      endpoint.inds <- which(outliers.by.chr[[i]][1:((length(outliers.by.chr[[i]]))-1)] - outliers.by.chr[[i]][-1] < -maxRange)
      endpoint.inds <- c(endpoint.inds, length(outliers.by.chr[[i]]))
      for (n in 1:length(endpoint.inds)) {
        if (n==length(endpoint.inds)) {
          outlier.range[2] <- outliers.by.chr[[i]][endpoint.inds[n]] + buffer
          outlier.regions[nrow(outlier.regions)+1,] <- c(chrs[i], outlier.range, "depth_outlier")
        } else {
          outlier.range[2] <- outliers.by.chr[[i]][endpoint.inds[n]] + buffer
          outlier.regions[nrow(outlier.regions)+1,] <- c(chrs[i], outlier.range, "depth_outlier")
          outlier.range[1] <- outliers.by.chr[[i]][endpoint.inds[n]+1] - buffer
        }
      }
    }
  }
  return(outlier.regions)
}

results <- find.IBDoutlier.range(FILE, maxRange, buffer)
write.table(results, "", quote=F, col.names=F, row.names=F, sep=" ")

R From line 11 of scripts/get_outlier_regions.R

args = commandArgs(trailingOnly=TRUE)
n.sds = 3

if (length(args)<1) {
  stop("One argument must be supplied", call.=FALSE)
} else if (length(args)>2) {
  stop("Too many arguments", call.=FALSE)
} else if (length(args)==2) {
  n.sds = as.numeric(args[2])
}

depth.file <- read.table(args[1], sep=" ", header=F)
upper <- mean(depth.file[,3]) + (n.sds * sd(depth.file[,3]))
outliers <- depth.file[which(depth.file[,3] >= upper),]

write.table(outliers, "", quote=F, col.names=F, row.names=F)

R From line 9 of scripts/get_outlier_SNPs.R

args = commandArgs(trailingOnly=TRUE)

# Check for correct number of arguments
if (length(args)<2) {
  stop("At least two arguments must be supplied", call.=FALSE)
} else if (length(args)>2) {
  stop("Too many arguments", call.=FALSE)
}

# Define functions
pkgTest <- function(x){
  if (!require(x,character.only = TRUE)){
    install.packages(x,dep=TRUE)
    if(!require(x,character.only = TRUE)) stop("Package not found")
  }
}

# Call packages
pkgTest("data.table")

# Read in data
options(stringsAsFactors=F)

fileName <- unlist(strsplit(args[1], split=".", fixed=TRUE))
ext <- fileName[length(fileName)]
if (ext == "hom") {
  seg.file <- read.table(args[1], header=T)
  colnames(seg.file)[4]<-"chr"
  colnames(seg.file)[7:8]<-c("segment.start", "segment.end")
} else if (ext == "ibd") {
  seg.file <- read.table(args[1], header=F)
  colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end")
} else if (ext == "match") {
  seg.file <- read.table(args[1], header=F)
  if (ncol(seg.file==13)) {
    colnames(seg.file)[3:5] <- c("chr", "segment.start", "segment.end")
  } else if (ncol(seg.file==15)) {
    colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end")
  }
}

bim <- read.table(args[2])
colnames(bim)[1:4] <- c("chr","rsid","dist","pos")


setDT(seg.file)
setDT(bim)
bim[, pos2 := pos ]
setkey(seg.file, chr, segment.start, segment.end)
setkey(bim, chr, pos, pos2)
newdepthcounts<-foverlaps(bim, seg.file) [, .(count = sum(!is.na(segment.start))), by = .(chr,pos, pos2) ][, pos2 := NULL ]
colnames(newdepthcounts)<-c("CHR","BP","COUNT")

write.table(newdepthcounts, "", quote=F, row.names=F, col.names=F, sep=" ")

R From line 10 of scripts/get_seg_depthv2.R

args = commandArgs(trailingOnly=TRUE)
# Check for correct number of arguments
if (length(args)<2) {
  stop("At least two arguments must be supplied", call.=FALSE)
} else if (length(args)>2) {
  stop("Too many arguments", call.=FALSE)
}

# Define functions
pkgTest <- function(x){
  if (!require(x,character.only = TRUE, quietly = T)){
    install.packages(x,dep=TRUE)
    if(!require(x,character.only = TRUE, quietly = T)) stop("Package not found")
  }
}

# Call packages
pkgTest("data.table")
pkgTest("GenomicRanges")

# Read in data
options(stringsAsFactors=F)

fileName <- unlist(strsplit(args[1], split=".", fixed=TRUE))
ext <- fileName[length(fileName)]
if (ext == "hom") {
  seg.file <- fread(args[1], header=T)
  colnames(seg.file)[4]<-"chr"
  colnames(seg.file)[7:8]<-c("segment.start", "segment.end")
} else if (ext == "ibd") {
  seg.file <- fread(args[1], header=F)
  colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end")
} else if (ext == "match") {
  seg.file <- fread(args[1], header=F)
  if (ncol(seg.file==13)) {
    colnames(seg.file)[3:5] <- c("chr", "segment.start", "segment.end")
  } else if (ncol(seg.file==15)) {
    colnames(seg.file)[5:7] <- c("chr", "segment.start", "segment.end")
  }
}

bim <- fread(args[2])
colnames(bim)[1:4] <- c("chr","rsid","dist","start")
bim[, end := start ]

#set key columns for data.tables
setkey(seg.file, chr, segment.start, segment.end)
setkey(bim, chr, start, end)

#calculate depth counts
#assign data.tables to GRanges objects
GRseg <- makeGRangesFromDataFrame(seg.file) 
GRbim <- makeGRangesFromDataFrame(bim)
#make new dataframe combining bim site info with counts
newdepthcounts <- as.data.frame(bim[,c("chr","start")])
#count number of segments that overlap sites in bim file assign to new column
newdepthcounts$count <- countOverlaps(GRbim, GRseg, type = "any") 
colnames(newdepthcounts)<-c("CHR","BP","COUNT") #assign final column names


write.table(newdepthcounts, "", quote=F, row.names=F, col.names=F, sep=" ") #write to out

R From line 20 of scripts/get_seg_depthv3.R

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!requireNamespace("GenomicRanges", quietly = TRUE))
    BiocManager::install("GenomicRanges")


#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

# Defaults
perc.thresh <- 0.05
perc.overlap <- 0.8

# Error handling and updating default values
if (length(args)<3) {
  stop("At least three arguments must be supplied", call.=FALSE)
} else if (length(args)==4) {
  perc.thresh <- as.numeric(args[4])
} else if (length(args)==5) {
  perc.thresh <- as.numeric(args[4])
  perc.overlap <- as.numeric(args[5])
} else if (length(args)>5) {
  stop("Too many arguments", call.=FALSE)
}

options(stringsAsFactors=F)
Bim <- read.table(args[1])
WinSize <- as.numeric(args[3])

# Determine type of input segment file
cat("Determining input file type... ")
fileNameSplit <- strsplit(args[2], split=".", fixed=T)[[1]]
ext.ind <- match("hom", fileNameSplit)

if (length(fileNameSplit) > 1 & !is.na(ext.ind)) {
  stem <- paste(fileNameSplit[-ext.ind], collapse=".")
  ext <- "hom"
} else {
  stem <- args[2]
  ext="garlic"
}

if (ext=="hom") {
  Seg <- read.table(args[2], header=T)
  if (ncol(Seg)==13) {
    fileType <- "PLINK"
    colnames(Seg)[9] <- "LENGTH"
  }
  cat(fileType, "file type detected\n")
} else {
  Seg <- read.table(args[2], skip=1)
  Seg <- cbind(Seg, rep("placeholder_ID", nrow(Seg)))
  if (ncol(Seg)==10) {
    fileType <- "GARLIC"
    cat("WARNING: GARLIC file type detected - header line dropped\n")
    colnames(Seg)[c(1,2,3,4,5,10)] <- c("CHR","POS1","POS2","CLASS","LENGTH","IID")
    for (i in 1:22) {
      Seg[which(Seg$CHR==paste("chr", i, sep="")), "CHR"] <- i
    }
    Seg$CHR <- as.numeric(Seg$CHR)
    class.ranges <- list(); length(class.ranges) <- 3; names(class.ranges) <- c("A","B","C")
    for (class in c("A","B","C")) {
      class.ranges[[class]] <- range(as.numeric(Seg[which(Seg$CLASS==class), "LENGTH"]))
    }
    class.ranges[["A"]][2] <- mean(class.ranges[["A"]][2], class.ranges[["B"]][1])
    class.ranges[["B"]][1] <- mean(class.ranges[["A"]][2], class.ranges[["B"]][1]) + 0.5
    class.ranges[["B"]][2] <- mean(class.ranges[["B"]][2], class.ranges[["C"]][1]) 
    class.ranges[["C"]][1] <- mean(class.ranges[["B"]][2], class.ranges[["C"]][1]) + 0.5
  }
}

# Create copy of input file - eventual output file
Seg.new <- Seg

genome.SNP.count <- c()
chr.hists <- list()
  length(chr.hists) <- length(unique(Bim$V1))
  names(chr.hists) <- unique(Bim$V1)

for (chr in unique(Bim$V1)) {
  chr.data <- Bim[Bim$V1==chr,]
  start <- chr.data[1, "V4"]
  end <- chr.data[nrow(chr.data), "V4"]
  chr.hists[[as.character(chr)]] <- hist(chr.data$V4, breaks=seq(start, end+WinSize, WinSize), plot=F)
  genome.SNP.count <- c(genome.SNP.count, chr.hists[[as.character(chr)]]$counts)
}
min.SNPs <- max(sort(genome.SNP.count)[length(genome.SNP.count)*perc.thresh], 1/5e4*WinSize)

suppressMessages(library("GenomicRanges"))
for (chr in sort(unique(Seg$CHR))) {
  cat("Starting chr", chr, "\n")
  # Calculate SNP density across the chromosome in windows
  Hist <- chr.hists[[as.character(chr)]]
  Counts <- Hist$counts
  Breaks <- Hist$breaks
  # Store count information per region
  if (fileType=="PLINK") {
    chrWinInfo <- as.data.frame(matrix(ncol=4, nrow=length(Counts)))
      colnames(chrWinInfo) <- c("chr", "start", "end", "nSNPs")
  } else if (fileType=="GARLIC") {
    chrWinInfo <- as.data.frame(matrix(ncol=3, nrow=length(Counts)))
      colnames(chrWinInfo) <- c("chr", "start", "end")
  }
  chrWinInfo$chr <- chr
  chrWinInfo$start <- Breaks[1:length(Counts)]
  chrWinInfo$end <- Breaks[2:length(Breaks)]
  chrWinInfo$nSNPs <- Counts
  # Find low density regions
  low.dens.wins <- chrWinInfo[which(chrWinInfo$nSNPs < min.SNPs), 1:3]
  # If low density regions are found join continguous windows into regions
  if (nrow(low.dens.wins) > 0) {
    Low.Density.Regions <- low.dens.wins[1, ]
    if (nrow(low.dens.wins) > 1) {
      for (x in 2:nrow(low.dens.wins)) {
        if (low.dens.wins[x, "start"] == Low.Density.Regions[nrow(Low.Density.Regions), "end"]) {
          Low.Density.Regions[nrow(Low.Density.Regions), "end"] <- low.dens.wins[x, "end"]
        } else {
          Low.Density.Regions[nrow(Low.Density.Regions)+1, ] <- low.dens.wins[x, ]
        }
      }
    }
    Low.Density.Ranges <- makeGRangesFromDataFrame(Low.Density.Regions)
    # Join segments in input file
    for (Ind in unique(Seg$IID)) {
      chrSegInd <- Seg[Seg$CHR==chr&Seg$IID==Ind,]
        chrSegInd <- chrSegInd[order(chrSegInd$POS1),]
      if (nrow(chrSegInd)>1) {
        RoH.gaps <- as.data.frame(matrix(ncol=3,nrow=0))
          colnames(RoH.gaps) <- c("chr","start","end")
        for (y in 2:max(2, nrow(chrSegInd))) {
          RoH.gaps[y-1, ] <- c(chr, chrSegInd[y-1, "POS2"], chrSegInd[y, "POS1"])
        }
        RoH.gaps <- RoH.gaps[which(RoH.gaps$end > RoH.gaps$start),]
        if (nrow(RoH.gaps) > 0) {
          RoH.gap.Ranges <- makeGRangesFromDataFrame(RoH.gaps)
          # Do any gaps overlap a low density region?
          Hits <- findOverlaps(RoH.gap.Ranges, Low.Density.Ranges) # find all hits
          Overlaps <- pintersect(Low.Density.Ranges[subjectHits(Hits)], RoH.gap.Ranges[queryHits(Hits)])
          percentOverlap <- width(Overlaps) / width(RoH.gap.Ranges[queryHits(Hits)]) # calculate overlap
          Hits <- Hits[percentOverlap > perc.overlap] # sufficient overlap?
          joinOverGaps <- RoH.gaps[queryHits(Hits),]
          if (nrow(joinOverGaps)>0) {
            for (z in 1:nrow(joinOverGaps)) {
              seg1 <- rownames(chrSegInd[which(chrSegInd$POS2 == joinOverGaps[z, "start"]),])[1]
              seg2 <- rownames(chrSegInd[which(chrSegInd$POS1 == joinOverGaps[z, "end"]),])[1]
              joinedSeg <- chrSegInd[as.character(seg1),]
              joinedSeg$POS2 <- chrSegInd[as.character(seg2),"POS2"]
              joinedSeg$LENGTH <- chrSegInd[as.character(seg1),"LENGTH"] + chrSegInd[as.character(seg2),"LENGTH"]
              if (fileType=="PLINK") {
                joinedSeg$SNP2 <- chrSegInd[as.character(seg2),"SNP2"]
                joinedSeg$NSNP <- chrSegInd[as.character(seg1),"NSNP"] + chrSegInd[as.character(seg2),"NSNP"]
              } else if (fileType=="GARLIC") {
                if (joinedSeg$LENGTH < class.ranges[["A"]][1]) {
                  joinedSeg$CLASS <- "A"
                } else if (joinedSeg$LENGTH >= class.ranges[["A"]][1] & joinedSeg$LENGTH <= class.ranges[["A"]][2]) {
                  joinedSeg$CLASS <- "A"
                } else if (joinedSeg$LENGTH >= class.ranges[["B"]][1] & joinedSeg$LENGTH <= class.ranges[["B"]][2]) {
                  joinedSeg$CLASS <- "B"
                } else if (joinedSeg$LENGTH >= class.ranges[["C"]][1] & joinedSeg$LENGTH <= class.ranges[["C"]][2]) {
                  joinedSeg$CLASS <- "C"
                } else if (joinedSeg$LENGTH > class.ranges[["C"]][2]) {
                  joinedSeg$CLASS <- "C"
                }
              }
              Seg.new <- rbind(Seg.new, joinedSeg)
              # Remove the two split segments from the Seg.joined object
              Seg.new <- Seg.new[-match(as.character(seg1), rownames(Seg.new)),]
              Seg.new <- Seg.new[-match(as.character(seg2), rownames(Seg.new)),]
            }
          }
        }
      }
    }
  }
}

if (fileType=="PLINK") {
  colnames(Seg.new)[9] <- "KB"
  write.table(Seg.new, paste(stem, "_segsJoined.hom", sep=""), quote=F, row.names=F, col.names=T)
} else if (fileType=="GARLIC") {
  Seg.new <- Seg.new[,-ncol(Seg.new)]
  Seg.new$CHR <- paste("chr", Seg.new$CHR, sep="")
  write.table(Seg.new, paste(stem, "_segsJoined.garlic", sep=""), quote=F, row.names=F, col.names=F, sep="\t")
}

R GenomicRanges BiocManager From line 12 of scripts/join_RoHsegs.R

INPUT=$1
MAP=$2
LOG=$3
OUTPUT=$4
REF_HAP=$5
REF_LEG=$6
REF_SAMP=$7

# ShapeIT phasing and conversion to plink files
#----------------------------------------------
# Divide plink file by chromosome, trim SNPs that lie outside the boundaries of the genetic map
# Divide plink file by chromosome, trim SNPs that lie outside the boundaries of the genetic map

Rscript scripts/trim_bimfile.R ${INPUT} ${MAP}
plink --bfile ${INPUT} --exclude ${INPUT}_removeSNPs --make-bed --out ${INPUT}_endsTrimmed
#cat ${FILE}.chr${chr}_endsTrimmed.bim >> dat_QCed_endsTrimmed.bim

# Run SHAPEIT
shapeit -B ${INPUT}_endsTrimmed -M ${MAP} --input-ref ${REF_HAP} ${REF_LEG} ${REF_SAMP} --duohmm -W 5 -O ${OUTPUT} --output-log ${LOG}.log -T 4

files=( ${LOG}'.snp.strand' )
if (( ${#files[@]} )); then
  N=`ls -l ${LOG}.snp.strand | wc -l`
  while [ $N -gt 0 ]
  do
      grep Missing ${LOG}.snp.strand | cut -f4 > ${INPUT}_missing_from_ref.txt
      grep Strand ${LOG}.snp.strand | awk '$11==1 {print $4}' > ${INPUT}_flip_in_dataset.txt
      grep Strand ${LOG}.snp.strand | awk '$11==0 {print $4}' > ${INPUT}_cannot_flip_in_dataset.txt
      cat ${INPUT}_missing_from_ref.txt ${INPUT}_cannot_flip_in_dataset.txt > ${INPUT}_remove_from_dataset.txt
      rm -f ${LOG}.snp.strand*
      rm -f ${INPUT}_missing_from_ref.txt ${INPUT}_cannot_flip_in_dataset.txt
      plink --bfile ${INPUT}_endsTrimmed --flip ${INPUT}_flip_in_dataset.txt --exclude ${INPUT}_remove_from_dataset.txt --make-bed --out ${INPUT}_endsTrimmed_misalignRepaired
      shapeit -B ${INPUT}_endsTrimmed_misalignRepaired -M ${MAP} --input-ref ${REF_HAP} ${REF_LEG} ${REF_SAMP} --duohmm -W 5 -O ${OUTPUT} --output-log ${LOG}.log -T 4
      N=`ls -l ${LOG}.snp.strand | wc -l`
  done
fi

Shell pLink SHAPEIT From line 1 of scripts/shapeit_iterate.sh

from numpy import *
from scipy.interpolate import interp1d
from sys import argv

# use, e.g.:
# python shapeit_to_germline.py shapeitoutputprefix geneticmapfile
# python shapeit_to_germline.py Ethiopians_chr21_SHAPEITphased_duoHMM genetic_map_chr21_combined_b37.txt

prefix = argv[-2]
hapsfile = prefix + '.haps'
samplefile = prefix + '.sample'

hapmapfile = argv[-1]

famdata = [line.strip().split() for line in file(samplefile)][2:]

# read in haplotypes
print 'reading in haps data...'
hapsdata = [line.strip().split() for line in file(hapsfile)]
print 'splitting haps data'
mapinfo = array([line[:3] for line in hapsdata])
variants = array([line[3:5] for line in hapsdata])
alleles = array([line[5:] for line in hapsdata],dtype=int)


#hapsdata = array([line.strip().split()[:3] for line in file(hapsfile)])
bps = array( mapinfo[:,2] , dtype=int)

# read in hapmap file
mapdata = array([line.strip().split() for line in file(hapmapfile)][2:],dtype=float)
happos = mapdata[:,0]
hapcms = mapdata[:,2]

# now create interpolation function
f = interp1d(happos,hapcms,fill_value='extrapolate')

bpcms = f(bps)

# make map file
outmap = file(prefix + '.map' , 'w')
print 'writing mapfile to %s' % (outmap)
for i, line in enumerate(mapinfo):
	outmap.write('%s\t%s\t%s\t%s\n' % (line[0],line[1],bpcms[i],line[2]))


# make ped file. More annoying for sure!
# brute force for now, hopefully it runs okay (not huge files)
outped = file(prefix + '.ped' , 'w')
print 'writing pedfile to %s' % (outped)
alleles_acgt = array([variants[i][alleles[i]] for i in range(len(variants))])
for i, line in enumerate(famdata):
	if i % 20 == 0:
		print 'on individual %s' % (i)
	# zip together haplotypes 2*i and 2*i+1 so they're on the same line
	plinkalleles = array(ndarray.flatten(array(zip(alleles_acgt[:,2*i],alleles_acgt[:,2*i+1] ) ) ), dtype='S1')
	outped.write('%s %s\n' % ( ' '.join(line[:2] + line[3:]) , ' '.join(plinkalleles)))

outmap.close()
outped.close()

print 'done'

Python numpy scipy From line 3 of scripts/shapeit_to_germline.py

shell:
  """
  plink --bfile data/{wildcards.dataset} --chr {wildcards.chrnum} --make-bed --out {params}
  """

SnakeMake pLink From line 81 of master/Snakefile

shell:
  """
  bash scripts/shapeit_iterate.sh {params.in_pre} {params.inputmap} {params.log_pre} {params.out_pre} {REF}_chr{wildcards.chrnum}.hap.gz {REF}_chr{wildcards.chrnum}.legend.gz {REF}.sample
  """

SnakeMake From line 99 of master/Snakefile

shell:
  """
  python scripts/shapeit_to_germline.py {params.prefix} {params.gmap}
  """

SnakeMake From line 114 of master/Snakefile

shell:
  """
  shapeit -convert --input-haps {params.in_pre} --output-vcf {output.vcf} --output-log {output.log}
  """

SnakeMake SHAPEIT From line 129 of master/Snakefile

shell:
  """
  sed -n -e '/^#/ p' prepare/chroms_phased_vcf/{wildcards.dataset}.chr1.phased.vcf > {output} #initialize allchr.vcf with headers
  for i in {input}; do sed '/^#/d' $i >> {output} ; done
  """

SnakeMake From line 141 of master/Snakefile

shell:
  """
  plink --vcf {input} --recode --out {params}
  """

SnakeMake pLink From line 154 of master/Snakefile

shell:
  """
  plink --file {params.in_pre} --cm-map {params.gmap} --recode --out {params.out_pre}
  """

SnakeMake pLink From line 168 of master/Snakefile

shell:
  """
  plink --vcf {input} --make-just-bim -out {params.out_pre}
  """

SnakeMake pLink From line 182 of master/Snakefile

shell:
  """
  Rscript scripts/find_lowDensityRegs.R {input} {params} 1e6 0.05
  """

SnakeMake From line 196 of master/Snakefile

shell:
  """
  gawk ' {{ t = $1; $1 = $2; $2 = t; print; }} ' {input} > {output}
  """

SnakeMake From line 210 of master/Snakefile

shell:
  """
  GERMLINE2 -mapfile {input.map} -pedfile {input.ped} -outfile {params.prefix} -err_hom {wildcards.err} -bits {wildcards.bits} -w_extend -haploid
  """

SnakeMake GERMLINE From line 223 of master/Snakefile

shell:
  """
  scripts/comb_match_files.sh {DATASET}
  """

SnakeMake From line 233 of master/Snakefile

shell:
  """
  awk '{{print $1 , "0" , $2 , "0" , $3 , $4 , $5, "20"}}' {input} > {output}
  """

SnakeMake From line 243 of master/Snakefile

shell:
  """
  cat {input.ibd} | java -jar progs/merge-ibd-segments.16May19.ad5.jar {input.vcf} {input.map} 0.6 1 | awk '{{print $1, $2, $3, $4, $5, $6, $7, $9}}' > {output}
  """

SnakeMake From line 255 of master/Snakefile

shell:
  """
  Rscript scripts/get_seg_depthv3.R {input.ibd} {input.bim} > {output}
  """

SnakeMake From line 267 of master/Snakefile

shell:
  """
  Rscript scripts/get_outlier_SNPs.R {input} 3 > {output}
  """

SnakeMake From line 277 of master/Snakefile

shell:
  """
  bash scripts/define_regions.sh {input} {DATASET}
  """

SnakeMake From line 287 of master/Snakefile

shell:
  """
  Rscript scripts/filter_RoH-IBD_segs.R {input.ibd} {input.txt} 0.85 > {output}
  """

SnakeMake From line 298 of master/Snakefile

shell:
  """
  plink --bfile {params.in_pre} --homozyg --homozyg-snp {N_SNPS} --homozyg-window-missing {MISSING} --homozyg-window-het {HET} --homozyg-kb {RoH_LEN} --out {params.out_pre}
  """

SnakeMake pLink From line 317 of master/Snakefile

shell:
  """
  Rscript scripts/join_RoHsegs.R {input.bim} {input.hom} 1e6 0.05 0.8
  """

SnakeMake From line 330 of master/Snakefile

shell:
  """
  Rscript scripts/find_lowDensityRegs.R {input} {params.out_pre} 1e6 0.05
  cat regions/exclude_regions_hg19.txt {output.ldr} > {output.reg}
  """

SnakeMake From line 349 of master/Snakefile

shell:
  """
  Rscript scripts/get_seg_depthv2.R {input.hom} {input.bim} > {output}
  """

SnakeMake From line 363 of master/Snakefile

shell:
  """
  Rscript scripts/get_outlier_SNPs.R {input.roh} > {output.outliers}
  Rscript scripts/get_outlier_regions.R {output.outliers} > {output.regions}
  cat {input.txt} {output.regions} > {output.txt}
  """

SnakeMake From line 378 of master/Snakefile

shell:
  """
  Rscript scripts/filter_RoH-IBD_segs.R {input.hom} {input.txt} 0.85 > {output}
  """

SnakeMake From line 393 of master/Snakefile

shell:
  """
  bash scripts/bits_matching.sh {DATASET}
  """

SnakeMake From line 408 of master/Snakefile

shell:
  """
  plink --bfile {params.pre_one} --maf {MAF} --make-bed --out {params.pre_two}_HWE-MAFfiltered
  plink --bfile {params.pre_two}_HWE-MAFfiltered --genome --out {params.pre_two}
  """

SnakeMake pLink From line 427 of master/Snakefile

shell:
  """
  bash scripts/calculate_concordance.sh {DATASET}
  """

SnakeMake From line 446 of master/Snakefile

shell:
  """
  bash scripts/choose_params.sh {input.ibd} {input.norm} {wildcards.dataset}
  """

SnakeMake From line 464 of master/Snakefile

shell:
  """
  cat {input.ibd} | java -jar progs/ibdne.04Sep15.e78.jar map={input.map} out=results/IBDne_output/{wildcards.dataset}_GERMLINE2_IBDNe_minibd-{cM} minibd={cM} nthreads={IBDne_THREADs} nboots=80 gmax={GMAX}
  """