Flashlite-Juicer: Juicer Implementation for FlashLite HPC

public public 1yr ago Version: Version 1 0 bookmarks

juicer

PBS version of https://github.com/aidenlab/juicer for Flashlite HPC.

If you find this useful, please acknowledge us, e.g:

The authors acknowledge code provided the Sydney Informatics Hub, a Core Research Facility of the University of Sydney.

Simple Intructions

To replicate the workflow from https://github.com/aidenlab/juicer/wiki/Running-Juicer-on-a-cluster follow these steps:

#Change to a directory you want to install and run your juicer analyses, e.g.
cd /30days/natbutter
#Clone this repo to that folder
git clone https://github.com/natbutter/juicer.git
#Change into the repo directory
cd juicer
#Link the PBS scripts for ease (as is done on the official repo)
ln -s PBS/scripts scripts
#Make and get references and restriction_site files (put your own here as needed)
mkdir references; cd references
wget https://s3.amazonaws.com/juicerawsmirror/opt/juicer/references/Homo_sapiens_assembly19.fasta
wget https://s3.amazonaws.com/juicerawsmirror/opt/juicer/references/Homo_sapiens_assembly19.fasta.amb
wget https://s3.amazonaws.com/juicerawsmirror/opt/juicer/references/Homo_sapiens_assembly19.fasta.ann
wget https://s3.amazonaws.com/juicerawsmirror/opt/juicer/references/Homo_sapiens_assembly19.fasta.bwt
wget https://s3.amazonaws.com/juicerawsmirror/opt/juicer/references/Homo_sapiens_assembly19.fasta.pac
wget https://s3.amazonaws.com/juicerawsmirror/opt/juicer/references/Homo_sapiens_assembly19.fasta.sa
mkdir ../restriction_sites; cd ../restriction_sites
wget https://s3.amazonaws.com/juicerawsmirror/opt/juicer/restriction_sites/hg19_MboI.txt
#Get the juicer tools file to run with your version (if you need a different one to the repo)
cd ../scripts
wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar
ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar
#Make a directory where you will be running your analysis (and populating with output files)
mkdir ../HIC003; cd ../HIC003
mkdir fastq; cd fastq
wget http://juicerawsmirror.s3.amazonaws.com/opt/juicer/work/HIC003/fastq/HIC003_S2_L001_R1_001.fastq.gz
wget http://juicerawsmirror.s3.amazonaws.com/opt/juicer/work/HIC003/fastq/HIC003_S2_L001_R2_001.fastq.gz

Now edit the scripts/juicer.sh script to point to the correct project, and adjust other configuration options. Note , you may need to adjust some of the hardcoded ram/cpu resource requests throughout the workflow in addition to those fixed in the config header lines.

juiceDir="/30days/natbutter/juicer"
project='NCMAS-xx99'

Finally, run the script which will launch all the jobs and submit them to the queue:

cd /30days/natbutter/juicer/HIC003
../scripts/juicer.sh

Directory structure

This is what my directory structrue looked like after setting everything up and running (to two levels deep):

natbutter@flashlite1:/30days/natbutter/juicer> tree -L 2
.
|-- HIC003
| |-- aligned
| |-- fastq
| |-- logs
| `-- splits
|-- PBS
| `-- scripts
|-- README.md
|-- references
| |-- Homo_sapiens_assembly19.fasta
| |-- Homo_sapiens_assembly19.fasta.amb
| |-- Homo_sapiens_assembly19.fasta.ann
| |-- Homo_sapiens_assembly19.fasta.bwt
| |-- Homo_sapiens_assembly19.fasta.pac
| `-- Homo_sapiens_assembly19.fasta.sa
|-- restriction_sites
| `-- hg19_MboI.txt
`-- scripts -> PBS/scripts

Significant changes and updates to get running on the Flashlite HPC

https://rcc.uq.edu.au/flashlite

Changes:

  • Added $project variable and spceifed in every qsub with #PBS -A $project

  • nodes=1:ppn:1 -> select=1:ncpus=1

  • Changed greps of "qstat" to cut at the correct place for Flashlite PBS

  • Hardcoded a few mem= and ncpus= that may need to adjusted for specific runs

  • Using local versions module load bwa/0.7.13 and module load Java/1.8.0_45

  • No GPUs on Flashlite, so the HiCCUPs step is not tested.

Original README for the PBS verison below

Very new to coding and this is my first "big" project in modifying a script. So welcome to make it better!

This PBS version is modified mainly based on the LSF version, but also took other versions as reference. Main changed to create this version is build the job dependencies to guarantee the sequential steps of each job in the original juicer.sh script. #PBS -W depend=afterok:jobID headerline is used. The launch stats steps and post processing steps are moved into two separate scripts to avoid the variable value problems (expansion) in multiple level of heredocs. These two scripts will be called from the main juicer.sh at the appropriate step. As a summary, the follwing scripts have been modified: juicer.sh split_rmdups.awk mega.sh The following two scripts has been added (extracted from the juicer.sh) launch_stats.sh postprocessing.sh

All steps has been tested till the ARROWHEAD step. *HiCCUP step has not been tested in this version because of some un-resolved error in loading Jcuda native libraries.

One important notice for successful run of this PBS version: The $groupname variable has to be maximally 7 characters long ** This length may change according to your cluster setup. See below for the reason. The jobs' dependency of this PBS version juicer is based on jobID of each step. And the jobID of each job is obtained through the qstat output based on the specific job Name, jobID_stepx= $(qstat |grep |cut -c 1-7). Unfortunately, PBS can not use job Name in building job dependency. The job Name column in qstat output has a maximum length of 16 characters by default. When the job Name exceeds this max length, the beginning part of the job Name string will be replaced by “…” and then followed by the last 13 characters (see examples below). If the job name is not fully displayed, the jobID value may be null due to the failure at grep step. Each job Name contains the $groupname, which gives each run specific timestamp and avoid's disruption from other runs of juicer from the same user. A potential cause of this problem is the ${groupname} value being too long. I have added comments in the juicer.sh script for setting this variable. Below is a example of job names being too long: $qstat Job ID Name User Time Use S Queue

2294690.pbs STDIN mzhibo 00:00:15 R batch
2294778.pbs make4Cbedgraph mzhibo 0 R batch
2294784.pbs ...dgraph1111111 mzhibo 0 Q batch
$

Zhibo

Code Snippets

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
res1=`cat ${splitdir}/*norm*res*`
check1=`cat ${splitdir}/*norm*res* | awk '{s2+=$2; s3+=$3; s4+=$4; s5+=$5; s6+=$6}END{if (s2 != s3+s4+s5+s6){print 0}else{print 1}}'`
if [ $check1 -eq 0 ] || [ -z "$res1" ]
then
    echo "***! Error! The statistics do not add up. Alignment likely failed to complete on one or more files. Run relaunch_prep.sh"
    echo "Stats don't add up.  Check ${outputdir} for results"
    exit 1
fi

# Check the sizes of merged_sort versus the dups/no dups files to be sure
# no reads were lost
total=1
total2=0
total=`ls -l ${outputdir}/merged_sort.txt | awk '{print $5}'`
total2=`ls -l ${outputdir}/merged_nodups.txt ${outputdir}/dups.txt ${outputdir}/opt_dups.txt | awk '{sum = sum + $5}END{print sum}'`

if [ -z $total ] || [ -z $total2 ] || [ $total -ne $total2 ]
then
    echo "***! Error! The sorted file and dups/no dups files do not add up, or were empty. Merge or dedupping likely failed, restart pipeline with -S merge or -S dedup"
    echo "Dups don't add up.  Check ${outputdir} for results"
    exit 1
fi

wctotal=`cat ${splitdir}/*_linecount.txt | awk '{sum+=$1}END{print sum/4}'`
check2=`cat ${splitdir}/*norm*res* | awk '{s2+=$2;}END{print s2}'`

if [ $wctotal -ne $check2 ]
then
    echo "***! Error! The number of reads in the fastqs (${wctotal}) is not the same as the number of reads reported in the stats (${check2}), likely due to a failure during alignment"
    echo "Reads don't add up.  Check ${outputdir} for results"
    exit 1
fi

if [ -n "$early" ]
then
    echo "(-: Pipeline successfully completed (-:";
    echo "Run cleanup.sh to remove the splits directory";
    echo "Check ${outputdir} for results"
elif [ -f ${outputdir}/inter.hic ] && [ -s ${outputdir}/inter.hic ] && [ -f ${outputdir}/inter_30.hic ] && [ -s ${outputdir}/inter_30.hic ]
then
    echo "(-: Pipeline successfully completed (-:";
    echo "Run cleanup.sh to remove the splits directory";
    echo "Check ${outputdir} for results"
else
    echo "***! Error! Either inter.hic or inter_30.hic were not created"
    echo "Either inter.hic or inter_30.hic were not created.  Check ${outputdir} for results"
    exit 1
fi
Shell From line 31 of scripts/check.sh
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
total=`ls -l aligned/merged_sort.txt | awk '{print $5}'`
total2=`ls -l aligned/merged_nodups.txt aligned/dups.txt aligned/opt_dups.txt | awk '{sum = sum + $5}END{print sum}'`
if [ $total -eq $total2 ] 
then 
    rm aligned/merged_sort.txt 
    rm -r splits 
    for i in fastq/*.fastq
    do
        gzip $i
    done
    gzip aligned/merged_nodups.txt
    gzip aligned/dups.txt
    gzip aligned/opt_dups.txt
    gzip aligned/abnormal.sam
    gzip aligned/collisions.txt
    gzip aligned/unmapped.sam
else 
    echo "Problem: The sum of merged_nodups and the dups files is not the same size as merged_sort.txt"
    echo "Did NOT clean up";
fi
Shell From line 28 of scripts/cleanup.sh
28
29
30
31
32
33
34
35
36
37
if [ "$usegzip" -eq 1 ]
then 
    num1=$(paste -d "" <(zcat ${name1}${ext}) <(zcat ${name2}${ext}) | grep -c $ligation)
    num2=$(zcat ${name1}${ext} | wc -l | awk '{print $1}')
else
    num1=$(paste -d "" ${name1}${ext} ${name2}${ext}| grep -c $ligation)
    num2=$(wc -l ${name1}${ext} | awk '{print $1}')
fi
echo -ne "$num1 " > ${name}${ext}_norm.txt.res.txt
echo "$num2" > ${name}${ext}_linecount.txt
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
use POSIX;

$site_file = "/opt/juicer/restriction_sites/hg19_DpnII.txt";
# Check arguments
if (scalar(@ARGV) == 2) {
  ($infile,$outfile) = @ARGV;
}
elsif (scalar(@ARGV) == 3) {
  ($infile,$outfile,$site_file) = @ARGV;
}
else {
  print "Usage: fragment_4dnpairs.pl <infile> <outfile> [site file]\n";
  print " <infile>: file in intermediate format to calculate statistics on\n";
  print " <outfile>: output, results of fragment search\n";  
  print " [site file]: list of restriction sites, one line per chromosome (default DpnII hg19)\n";
  exit;
}
# Global variables for calculating statistics
my %chromosomes;
my %hindIII;

# read in restriction site file and store as multidimensional array
open FILE, $site_file or die $!;

while (<FILE>) {
  my @locs = split;
  my $key = shift(@locs);
  my $ref = \@locs;
  $chromosomes{$key} = $ref;
	if ($key == "14") {
		$chromosomes{$key."m"} = $ref;  
		$chromosomes{$key."p"} = $ref;
	}
}
close(FILE);

# read in infile and calculate statistics
open INFILE, $infile or die $!;
open OUTFILE,">", $outfile or die $!;
$"="\t";
while (<INFILE>) {
  chomp;
  if(/^#columns:\s*/) { 
      if(/frag1/ || /frag2/) { die "frag columns already exist. Aborting..\n"; }
      print OUTFILE "$_ frag1 frag2\n"; next; 
  }
  elsif(/^#/) { print OUTFILE "$_\n"; next; }
  my @original_record = split;
  my @record = @original_record[5,1,2,6,3,4];

  # find upper index of position in sites array via binary search
  my $index1 = &bsearch($record[2],$chromosomes{$record[1]});
  my $index2 = &bsearch($record[5],$chromosomes{$record[4]});
  print OUTFILE "@original_record\t$index1\t$index2\n";
}
close(INFILE);
close(OUTFILE);

# Binary search, array passed by reference
# search array of integers a for given integer x
# return index where found or upper index if not found
sub bsearch {
  my ($x, $a) = @_;            # search for x in array a
  my ($l, $u) = (0, @$a - 1);  # lower, upper end of search interval
  my $i;                       # index of probe
  while ($l <= $u) {
    $i = int(($l + $u)/2);
    if ($a->[$i] < $x) {
      $l = $i+1;
    }
    elsif ($a->[$i] > $x) {
      $u = $i-1;
    } 
    else {
      return $i+1; # found
    }
  }
  return $l;         
}
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
use POSIX;

$site_file = "/opt/juicer/restriction_sites/hg19_DpnII.txt";
# Check arguments
if (scalar(@ARGV) == 2) {
  ($infile,$outfile) = @ARGV;
}
elsif (scalar(@ARGV) == 3) {
  ($infile,$outfile,$site_file) = @ARGV;
}
else {
  print "Usage: fragment.pl <infile> <outfile> [site file]\n";
  print " <infile>: file in intermediate format to calculate statistics on\n";
  print " <outfile>: output, results of fragment search\n";  
  print " [site file]: list of restriction sites, one line per chromosome (default DpnII hg19)\n";
  exit;
}
# Global variables for calculating statistics
my %chromosomes;
my %hindIII;

# read in restriction site file and store as multidimensional array
open FILE, $site_file or die $!;

while (<FILE>) {
  my @locs = split;
  my $key = shift(@locs);
  my $ref = \@locs;
  $chromosomes{$key} = $ref;
	if ($key == "14") {
		$chromosomes{$key."m"} = $ref;  
		$chromosomes{$key."p"} = $ref;
	}
}
close(FILE);

# read in infile and calculate statistics
open INFILE, $infile or die $!;
open OUTFILE,">", $outfile or die $!;

while (<INFILE>) {
  my @record = split;

  # find upper index of position in sites array via binary search
  my $index1 = &bsearch($record[2],$chromosomes{$record[1]});
  my $index2 = &bsearch($record[5],$chromosomes{$record[4]});
  print OUTFILE $record[0] . " " . $record[1] . " " . $record[2] . " " . $index1 . " " . $record[3] . " " . $record[4] . " " . $record[5] . " " . $index2 . " ";
	for (my $i=6; $i < scalar(@record); $i++){
		print OUTFILE $record[$i] . " ";
	}
	print OUTFILE "\n";
}
close(INFILE);
close(OUTFILE);

# Binary search, array passed by reference
# search array of integers a for given integer x
# return index where found or upper index if not found
sub bsearch {
  my ($x, $a) = @_;            # search for x in array a
  my ($l, $u) = (0, @$a - 1);  # lower, upper end of search interval
  my $i;                       # index of probe
  while ($l <= $u) {
    $i = int(($l + $u)/2);
    if ($a->[$i] < $x) {
      $l = $i+1;
    }
    elsif ($a->[$i] > $x) {
      $u = $i-1;
    } 
    else {
      return $i+1; # found
    }
  }
  return $l;         
}
Perl From line 37 of scripts/fragment.pl
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
usageHelp="Usage: ${0} [-h] -j <juicer_tools_file_path> -i <hic_file_path>"

printHelpAndExit() {
    echo "$usageHelp"
    exit $1
}

#set defaults
genomeID="hg19"
hic_file_path="$(pwd)/aligned/inter_30.hic"
juicer_tools_path="/opt/juicer/scripts/juicer_tools"

while getopts "h:j:i:" opt; do
    case $opt in
	h) printHelpAndExit 0;;
	j) juicer_tools_path=$OPTARG ;;
	i) hic_file_path=$OPTARG ;;
	[?]) printHelpAndExit 1;;
    esac
done

## Check that juicer tools exists 
if [ ! -e "${juicer_tools_path}" ]; then
  echo "***! Can't find juicer tools in ${juicer_tools_path}";
  exit 100;
fi

## Check that hic file exists    
if [ ! -e "${hic_file_path}" ]; then
  echo "***! Can't find inter.hic in ${hic_file_path}";
  exit 100;
fi

echo -e "${juicer_tools_path} is post-processing Hi-C for ${genomeID}\nData read from ${hic_file_path}.\nMotifs read from ${bed_file_dir}\n"
echo -e "ARROWHEAD:\n"
${juicer_tools_path} arrowhead ${hic_file_path} ${hic_file_path%.*}"_contact_domains.txt"
if [ $? -ne 0 ]; then
    echo "***! Problem while running Arrowhead";
    exit 100
else
    echo -e "\n(-: Arrowhead Postprocessing successfully completed (-:"
fi
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
usageHelp="Usage: ${0} [-h] -j <juicer_tools_file_path> -i <hic_file_path> -m <bed_file_dir> -g <genome ID>"

printHelpAndExit() {
    echo "$usageHelp"
    exit $1
}

#set defaults
genomeID="hg19"
hic_file_path="$(pwd)/aligned/inter_30.hic"
juicer_tools_path="/opt/juicer/scripts/juicer_tools"
bed_file_dir="/opt/juicer/references/motif"

while getopts "h:g:j:i:m:" opt; do
    case $opt in
	h) printHelpAndExit 0;;
	j) juicer_tools_path=$OPTARG ;;
	i) hic_file_path=$OPTARG ;;
	m) bed_file_dir=$OPTARG ;; 
	g) genomeID=$OPTARG ;;
	[?]) printHelpAndExit 1;;
    esac
done

## Check that juicer_tools exists 
if [ ! -e "${juicer_tools_path}" ]; then
  echo "***! Can't find juicer tools in ${juicer_tools_path}";
  exit 100;
fi

## Check that hic file exists    
if [ ! -e "${hic_file_path}" ]; then
  echo "***! Can't find inter.hic in ${hic_file_path}";
  exit 100;
fi

## Check that bed folder exists    
if [ ! -e "${bed_file_dir}" ]; then
  echo "***! Can't find folder ${bed_file_dir}";
  exit 100;
fi

echo -e "\nHiCCUPS:\n"
${juicer_tools_path} hiccups ${hic_file_path} ${hic_file_path%.*}"_loops.txt"
if [ $? -ne 0 ]; then
    echo "***! Problem while running HiCCUPS";
    exit 100
fi

if [ -f ${hic_file_path%.*}"_loops.txt" ]
then
    echo -e "\nAPA:\n"
    ${juicer_tools_path} apa ${hic_file_path} ${hic_file_path%.*}"_loops.txt" "apa_results"
    echo -e "\nMOTIF FINDER:\n"
    ${juicer_tools_path} motifs ${genomeID} ${bed_file_dir} ${hic_file_path%.*}"_loops.txt"
    echo -e "\n(-: Feature annotation successfully completed (-:"
else
    # if loop lists do not exist but Juicer tools didn't return an error, likely 
    # too sparse
    echo -e "\n(-: Postprocessing successfully completed, maps too sparse to annotate (-:"
fi
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
juiceDir="/lustre1/mzhibo/hic/apps/juicer"
# default queue, can also be set in options via -q
queue="batch"
# default queue time, can also be set in options via -Q
walltime="walltime=24:00:00"
# default long queue, can also be set in options via -l
long_queue="batch"
# default long queue time, can also be set in options via -L
long_walltime="walltime=120:00:00"
read1str="_R1"
read2str="_R2"
roupname="C$(date "+%s"|cut -c 6-11)"
## Default options, overridden by command line arguments
groupname=$(date +"%s" |cut -c 6-11)
# top level directory, can also be set in options
topDir=$(pwd)
# restriction enzyme, can also be set in options
site="DpnII"
# genome ID, default to human, can also be set in options
genomeID="dm6_clean"
# normally both read ends are aligned with long read aligner; 
# if one end is short, this is set                 
shortreadend=0
# description, default empty
about=""
nofrag=0

# Directories to be created and regex strings for listing files
splitdir=${topDir}"/splits"
donesplitdir=$topDir"/done_splits"
fastqdir=${topDir}"/fastq/*_R*.fastq"
outputdir=${topDir}"/aligned"
tmpdir=${topDir}"/HIC_tmp"
logdir=${topDir}"/logs"
read1=${splitdir}"/*${read1str}*.fastq"
    timestamp=$(date +"%s" | cut -c 4-10)
    qsub <<ALIGNWRAP
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $walltime
    #PBS -l nodes=1:ppn=1:AMD
    #PBS -l mem=2gb
    #PBS -o ${logdir}/${timestamp}_alnwrap_${groupname}.log
    #PBS -j oe
    #PBS -N AlnWrap_${groupname}
    #PBS -M mzhibo@uga.edu
    #PBS -m a
    for i in ${read1}
    do

        countjobs=\$(( countjobs + 1 ))
        ext=\${i#*$read1str}
        name=\${i%$read1str*} 
        # these names have to be right or it'll break
        name1=\${name}${read1str}
        name2=\${name}${read2str}    
        jname=\$(basename "\$name")\${ext}
        usegzip=0
        if [ "\${ext}" == ".gz" ]
        then
            usegzip=1
        fi
       qsub <<- MRGALL
        #PBS -S /bin/bash
        #PBS -q $queue
        #PBS -l $long_walltime
        #PBS -l nodes=1:ppn=1:AMD
        #PBS -l mem=24gb
        #PBS -M mzhibo@uga.edu
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_\${jname}_merge_\${countjobs}_${groupname}.log
        #PBS -j oe
        #PBS -N Mrg_\${countjobs}${groupname}
        #PBS -v name=\${name}
        #PBS -v name1=\${name1}
        #PBS -v name2=\${name2}
        #PBS -v ext=\${ext}
        #PBS -v countjobs=\${countjobs}

        date +"%Y-%m-%d %H:%M:%S"
        export LC_ALL=C

        # sort read 1 aligned file by readname 
        # remove header, add read end indicator to read name
        awk -v name1=\${name1} -v ext=\${ext} -v name=\${name} -f ${juiceDir}/scripts/read1sort.awk \${name1}\${ext}_sort.sam > \${name1}\${ext}_sort1.sam

        #echo awk 'NF >= 11{ \$1 = \$1"/1";print}' \${name1}\${ext}_sort.sam > \${name1}\${ext}_sort1.sam'

        awk -v name1=\${name2} -v ext=\${ext} -v name=\${name} -f ${juiceDir}/scripts/read2sort.awk \${name1}\${ext}_sort.sam > \${name2}\${ext}_sort1.sam

        #awk '{if (NF>=11) {\$1=\$1"read2"; print $0}}' \${name1}\${ext}_sort.sam > \${name1}\${ext}_sort1.sam

        #awk 'NF >= 11{ \$1 = \$1"/2";print}' \${name2}\${ext}_sort.sam > \${name2}\${ext}_sort1.sam    
        # merge the two sorted read end files
        sort -T $tmpdir -k1,1 -m \${name1}\${ext}_sort1.sam \${name2}\${ext}_sort1.sam > \${name}\${ext}.sam
        if [ $? -ne 0 ]
        then
            echo "***! Failure during merge of read files"
            echo "Merge of \${name}\${ext}.sam failed"
            exit 1
        else
            # rm \${name1}\${ext}.sa* \${name2}\${ext}.sa* \${name1}\${ext}_sort*.sam \${name2}\${ext}_sort*.sam
            echo "\${name}\$next.sam created successfully."
        fi 
MRGALL
    wait

    jID_3=\$(qstat | grep "Mrg_\${countjobs}${groupname}" | cut -c 1-7 )
    echo "merging align1 and align2 \${coutjobs} id is \${jID_3}"
    echo "starting chimeric step after alignment"
    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<- CHIMERIC
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $walltime
    #PBS -l nodes=1:ppn=1:AMD
    #PBS -l mem=24gb
    #PBS -M mzhibo@uga.edu
    #PBS -m a
    #PBS -o ${logdir}/\${timestamp}_\${jname}_chimeric_\${countjobs}_${groupname}.log
    #PBS -j oe
    #PBS -N Chmr_\${countjobs}${groupname}
    #PBS -W depend=afterok:\${jID_3}
    #PBS -v name=\${name}
    #PBS -v ext=\${ext}

    date +"%Y-%m-%d %H:%M:%S"
    export LC_ALL=C    
    # call chimeric_blacklist.awk to deal with chimeric reads; sorted file is sorted by read name at this point
    awk -v "fname1"=\${name}\${ext}_norm.txt -v "fname2"=\${name}\${ext}_abnorm.sam -v "fname3"=\${name}\${ext}_unmapped.sam -f ${juiceDir}/scripts/chimeric_blacklist.awk \${name}\${ext}.sam

    if [ \$? -ne 0 ]                                              
    then                                       
        echo "***! Failure during chimera handling of \${name}\${ext}"
        echo "Chimera handling of \${name}\${ext}.sam failed."
        exit 1                                                     
    fi  
    # if any normal reads were written, find what fragment they correspond to and store that
    if [ -e "\${name}\${ext}_norm.txt" ] && [ "$site" != "none" ] 
    then  
        ${juiceDir}/scripts/fragment.pl \${name}\${ext}_norm.txt \${name}\${ext}.frag.txt $site_file
    elif [ "$site" == "none" ]
    then
        awk '{printf("%s %s %s %d %s %s %s %d", \$1, \$2, \$3, 0, \$4, \$5, \$6, 1); for (i=7; i<=NF; i++) {printf(" %s",\$i);}printf("\n");}' \${name}\${ext}_norm.txt > \${name}\${ext}.frag.txt 
    else 
        echo "***! No \${name}\${ext}_norm.txt file created"
        echo "Creation of \${name}\${ext}_norm.txt failed. for results"
        exit 1
    fi
    if [ \$? -ne 0 ] 
    then 
        echo "***! Failure during fragment assignment of \${name}\${ext}"
        echo "Fragment assignment of \${name}\${ext}.sam failed."
        exit 1 
    fi
    # sort by chromosome, fragment, strand, and position
    sort -T $tmpdir -k2,2d -k6,6d -k4,4n -k8,8n -k1,1n -k5,5n -k3,3n \${name}\${ext}.frag.txt > \${name}\${ext}.sort.txt
    if [ \$? -ne 0 ]
    then  
        echo "***! Failure during sort of \${name}\${ext}"
        echo "Sort of \${name}\${ext}.frag.txt failed."
        exit 1                                                       
    else
        rm \${name}\${ext}_norm.txt \${name}\${ext}.frag.txt
    fi    
CHIMERIC
    wait

    jID_4=\$(qstat | grep "Chmr_\${countjobs}${groupname}" | cut -c 1-7)
    echo "chimeric \$countjobs id is \$jID_4"
    jobIDstring="\${jobIDstring}:\${jID_4}"
    echo "jobIDstring \$countjobs is \${jobIDstring}"

    # done looping over all fastq split files
    done


    # if error occored, we will kill the remaining jobs
    # output an error message of error detection and killing the remaining jobs
    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<- CKALIGNFAIL
    #PBS -S /bin/bash
    #PBS -q $queue  
    #PBS -l nodes=1:ppn=1:AMD
    #PBS -l mem=2gb
    #PBS -l $walltime
    #PBS -M mzhibo@uga.edu
    #PBS -m a
    #PBS -o ${logdir}/\${timestamp}_check_alnOK_${groupname}.log
    #PBS -j oe
    #PBS -W depend=afterok\${jobIDstring}
    #PBS -N AlnOK_${groupname}

    date +"%Y-%m-%d %H:%M:%S"
    echo "Sucess: All alignment jobs were successfully finished without failure!"
CKALIGNFAIL

    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<- CKALIGNFAILCLN
    #PBS -S /bin/bash
    #PBS -q $queue  
    #PBS -l select=1:ncpus=1:mem=4gb
    #PBS -l $walltime
    #PBS -o ${logdir}/\${timestamp}_alignfailclean_${groupname}.log
    #PBS -j oe
    #PBS -W depend=afternotok\${jobIDstring}
    #PBS -N Alncln_${groupname}

    date +"%Y-%m-%d %H:%M:%S"
    echo "Error with alignment jobs, deleting all remaining jobs of this pipeline."
    remrem=\$(qstat |grep "$groupname" |grep " Q \| H \| R " | awk 'BEGIN{FS=" "}{print $1}')                                                                echo $remrem                                                                                                                                             RemJob=\$(qstat |grep "$groupname" |grep " Q \| H \| R " | awk 'BEGIN{FS=" "}{print $1}'| cut -d '.' -f 1)                                               echo ${RemJob}  
    RemJob=\$(qstat |grep ${groupname} |grep " Q \| H \| R " | awk 'BEGIN{FS=" "}{print \$1}'| cut -c 1-7)
    qdel \$RemJob

CKALIGNFAILCLN

ALIGNWRAP
    #done fastq alignment && alignment jobs failure checking.
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
usageHelp="Usage: ${0} [-h] -j <juicer_tools_file_path> -i <hic_file_path> -m <bed_file_dir> -g <genome ID>"

printHelpAndExit() {
    echo "$usageHelp"
    exit $1
}

#set defaults
genomeID="9"
hic_file_path="$(pwd)/aligned/inter_30.hic"
juicer_tools_path="/opt/juicer/scripts/juicer_tools"
bed_file_dir="/opt/juicer/references/motif"

while getopts "h:g:j:i:m:" opt; do
    case $opt in
	h) printHelpAndExit 0;;
	j) juicer_tools_path=$OPTARG ;;
	i) hic_file_path=$OPTARG ;;
	m) bed_file_dir=$OPTARG ;; 
	g) genomeID=$OPTARG ;;
	[?]) printHelpAndExit 1;;
    esac
done

## Check that juicer_tools exists 
if [ ! -e "${juicer_tools_path}" ]; then
    echo "***! Can't find juicer tools in ${juicer_tools_path}";
    exit 1
fi

## Check that hic file exists    
if [ ! -e "${hic_file_path}" ]; then
    echo "***! Can't find inter.hic in ${hic_file_path}";
    exit 1
fi

echo -e "${juicer_tools_path} is post-processing Hi-C for ${genomeID}\nData read from ${hic_file_path}.\nMotifs read from ${bed_file_dir}\n"
echo -e "ARROWHEAD:\n"
${juicer_tools_path} arrowhead ${hic_file_path} ${hic_file_path%.*}"_contact_domains.txt"
if [ $? -ne 0 ]; then
    echo "***! Problem while running Arrowhead";
    exit 1
fi
echo -e "\nHiCCUPS:\n"
if hash nvcc 2>/dev/null 
then 
    ${juicer_tools_path} hiccups ${hic_file_path} ${hic_file_path%.*}"_loops.txt"
    if [ $? -ne 0 ]; then
	echo "***! Problem while running HiCCUPS";
	exit 1
    fi
else 
    echo "GPUs are not installed so HiCCUPs cannot be run";
fi

if [ -f ${hic_file_path%.*}"_loops.txt" ]
then
    echo -e "\nAPA:\n"
    ${juicer_tools_path} apa ${hic_file_path} ${hic_file_path%.*}"_loops.txt" "apa_results"
    ## Check that bed folder exists    
    if [ ! -e "${bed_file_dir}" ]; then
	echo "***! Can't find folder ${bed_file_dir}";
	echo "***! Not running motif finder";
    else
	echo -e "\nMOTIF FINDER:\n"
	${juicer_tools_path} motifs ${genomeID} ${bed_file_dir} ${hic_file_path%.*}"_loops.txt"
    fi
    echo -e "\n(-: Feature annotation successfully completed (-:"
else
  # if loop lists do not exist but Juicer Tools didn't return an error, likely 
  # too sparse
    echo -e "\n(-: Postprocessing successfully completed, maps too sparse to annotate or GPUs unavailable (-:"
fi
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
shopt -s extglob
juicer_version="1.5.6"

# timestamp the pipeline and report the juicer app version, and shell
date +"%Y-%m-%d %H:%M:%S"
echo "Juicer version:$juicer_version"
echo "$0 $@"

## Set the following variables to work with your system
# path additionals, make sure paths are correct for your system
#myPath=/bin:$PATH
# set global tmpdir so no problems with /var/tmp
## use cluster load commands:
#usePath=""
load_bwa="module load bwa/0.7.13"
load_java='module load Java/1.8.0_45'
#load_cluster=""
#load_coreutils=""
#load_cuda='module load cuda/7.5.18/gcc/4.4.7'

# Juicer directory, contains scripts/, references/, and restriction_sites/
# can also be set in options via -D
juiceDir="/30days/nathanielpeterbutterworth/juice/"
# default queue, can also be set in options via -q
queue="workq"
# default queue time, can also be set in options via -Q
walltime="walltime=24:00:00"
# default long queue, can also be set in options via -l
long_queue="workq"
# default long queue time, can also be set in options via -L
long_walltime="walltime=120:00:00"
# size to split fastqs. adjust to match your needs. 4000000=1M reads per split
# can also be changed via the -C flag
# give your email address to be used in #PBS -M to receive notifications when job error occurs.
# Must be either set with an email address or skipped
# This email is not included in the launch stat and postprocessing steps, add manually if needed
EMAIL='' #'#PBS -M nathaniel.butterworth@sydney.edu.au'
project='NCMAS-ch81' 
splitsize=90000000
# fastq files should look like filename_R1.fastq and filename_R2.fastq
# if your fastq files look different, change this value
read1str="_R1"
read2str="_R2"


# unique groupname for jobs submitted in each run. lab initial with an timestamp
# Length of $groupname in this PBS version needs to be no longer than 8 characters.
# Groupname Length limitatioin is critical, because jobID will be obtained through qstat output using jobName.
# The string being searched from qstat output is "jobstr_${groupname}", where "jobstr" is
# the job specific name string in the qsub -N option. max length of "jobstr" in this PBS version script is 7.
# our PBS cluster has an maximum jobName column width of 16.
# Change the $groupname max length accordingly based on your PBS cluster qstat output setup.
groupname="C$(date "+%s"|cut -c 6-7)"
## Default options, overridden by command line arguments

# top level directory, can also be set in options
topDir=$(pwd)
# restriction enzyme, can also be set in options
site="MboI"
# genome ID, default to human, can also be set in options
genomeID="hg19"
# normally both read ends are aligned with long read aligner;
# if one end is short, this is set
shortreadend=0
# description, default empty
about=""
nofrag=0

## Read arguments
usageHelp="Usage: ${0##*/} [-W group_list=genomeID] [-d topDir] [-q queue] [-l long queue] [-s site]\n                 [-a about] [-R end] [-S stage] [-p chrom.sizes path]\n                 [-y restriction site file] [-z reference genome file]\n                 [-C chunk size] [-D Juicer scripts directory]\n                 [-Q queue time limit] [-L long queue time limit] [-r] [-h] [-x]"
genomeHelp="* [genomeID] must be defined in the script, e.g. \"hg19\" or \"mm10\" (default \n  \"$genomeID\"); alternatively, it can be defined using the -z command"
dirHelp="* [topDir] is the top level directory (default\n  \"$topDir\")\n     [topDir]/fastq must contain the fastq files\n     [topDir]/splits will be created to contain the temporary split files\n     [topDir]/aligned will be created for the final alignment"
queueHelp="* [queue] is the queue for running alignments (default \"$queue\")"
longQueueHelp="* [long queue] is the queue for running longer jobs such as the hic file\n  creation (default \"$long_queue\")"
siteHelp="* [site] must be defined in the script, e.g.  \"HindIII\" or \"MboI\" \n  (default \"$site\")"
aboutHelp="* [about]: enter description of experiment, enclosed in single quotes"
shortHelp="* -r: use the short read version of the aligner, bwa aln\n  (default: long read, bwa mem)"
shortHelp2="* [end]: use the short read aligner on read end, must be one of 1 or 2 "
stageHelp="* [stage]: must be one of \"merge\", \"dedup\", \"final\", \"postproc\", or \"early\".\n    -Use \"merge\" when alignment has finished but the merged_sort file has not\n     yet been created.\n    -Use \"dedup\" when the files have been merged into merged_sort but\n     merged_nodups has not yet been created.\n    -Use \"final\" when the reads have been deduped into merged_nodups but the\n     final stats and hic files have not yet been created.\n    -Use \"postproc\" when the hic files have been created and only\n     postprocessing feature annotation remains to be completed.\n    -Use \"early\" for an early exit, before the final creation of the stats and\n     hic files"
pathHelp="* [chrom.sizes path]: enter path for chrom.sizes file"
siteFileHelp="* [restriction site file]: enter path for restriction site file (locations of\n  restriction sites in genome; can be generated with the script\n  misc/generate_site_positions.py)"
chunkHelp="* [chunk size]: number of lines in split files, must be multiple of 4\n  (default ${splitsize}, which equals $(awk -v ss=${splitsize} 'BEGIN{print ss/4000000}') million reads)"
scriptDirHelp="* [Juicer scripts directory]: set the Juicer directory,\n  which should have scripts/ references/ and restriction_sites/ underneath it\n  (default ${juiceDir})"
refSeqHelp="* [reference genome file]: enter path for reference sequence file, BWA index\n  files must be in same directory"
queueTimeHelp="* [queue time limit]: time limit for queue, i.e. -l 12:00 is 12 hours\n  (default ${walltime})"
longQueueTimeHelp="* [long queue time limit]: time limit for long queue, i.e. -l 168:00 is one week\n  (default ${long_walltime})"
excludeHelp="* -x: exclude fragment-delimited maps from hic file creation"
helpHelp="* -h: print this help and exit"

printHelpAndExit() {
    echo -e "$usageHelp"
    echo -e "$genomeHelp"
    echo -e "$dirHelp"
    echo -e "$queueHelp"
    echo -e "$longQueueHelp"
    echo -e "$siteHelp"
    echo -e "$aboutHelp"
    echo -e "$shortHelp"
    echo -e "$shortHelp2"
    echo -e "$stageHelp"
    echo -e "$pathHelp"
    echo -e "$siteFileHelp"
    echo -e "$refSeqHelp"
    echo -e "$chunkHelp"
    echo -e "$scriptDirHelp"
    echo -e "$queueTimeHelp"
    echo -e "$longQueueTimeHelp"
    echo "$excludeHelp"
    echo "$helpHelp"
    exit "$1"
}

while getopts "d:g:R:k:a:hrq:s:p:l:y:z:S:C:D:Q:L:x" opt; do
    case $opt in
    g) genomeID=$OPTARG ;;
    h) printHelpAndExit 0;;
    d) topDir=$OPTARG ;;
    l) long_queue=$OPTARG ;;
    q) queue=$OPTARG ;;
    s) site=$OPTARG ;;
    R) shortreadend=$OPTARG ;;
    r) shortread=1 ;;  #use short read aligner
    a) about=$OPTARG ;;
    p) genomePath=$OPTARG ;;
    y) site_file=$OPTARG ;;
    z) refSeq=$OPTARG ;;
    S) stage=$OPTARG ;;
    C) splitsize=$OPTARG ;;
    D) juiceDir=$OPTARG ;;
    Q) walltime=$OPTARG ;;
    L) long_walltime=$OPTARG ;;
    x) nofrag=1 ;; #no fragment maps
    [?]) printHelpAndExit 1;;
    esac
done

if [ ! -z "$stage" ]
then
    case $stage in
        merge) merge=1 ;;
        dedup) dedup=1 ;;
        early) earlyexit=1 ;;
        final) final=1 ;;
        postproc) postproc=1 ;;
        *)  echo "$usageHelp"
        echo "$stageHelp"
        exit 1
    esac
fi

## Set reference sequence based on genome ID
if [ -z "$refSeq" ]
then
    case $genomeID in
    mm9) refSeq="${juiceDir}/references/Mus_musculus_assembly9_norandom.fasta";;
    mm10) refSeq="${juiceDir}/references/Mus_musculus_assembly10.fasta";;
    hg38) refSeq="${juiceDir}/references/hg38.fa";;
    hg19) refSeq="${juiceDir}/references/Homo_sapiens_assembly19.fasta";;
    *)  echo "$usageHelp"
        echo "$genomeHelp"
        exit 1
    esac
else
    # Reference sequence passed in, so genomePath must be set for the .hic file
    # to be properly created
    if [ -z "$genomePath" ]
        then
        echo "***! You must define a chrom.sizes file via the \"-p\" flag that delineates the lengths of the chromosomes in the genome at $refSeq";
        exit 1;
    fi
fi

## Check that refSeq exists
if [ ! -e "$refSeq" ]; then
    echo "***! Reference sequence $refSeq does not exist";
    exit 1;
fi

## Check that index for refSeq exists
if [ ! -e "${refSeq}.bwt" ]; then
    echo "***! Reference sequence $refSeq does not appear to have been indexed. Please run bwa index on this file before running juicer.";
    exit 1;
fi

## Set ligation junction based on restriction enzyme
case $site in
    HindIII) ligation="AAGCTAGCTT";;
    DpnII) ligation="GATCGATC";;
    MboI) ligation="GATCGATC";;
    NcoI) ligation="CCATGCATGG";;
    none) ligation="XXXX";;
    *)  ligation="XXXX"
    echo "$site not listed as recognized enzyme. Using $site_file as site file"
    echo "Ligation junction is undefined"
    exit 1
esac

## If DNAse-type experiment, no fragment maps
if [ "$site" == "none" ]
then
    nofrag=1;
fi

## If short read end is set, make sure it is 1 or 2
case $shortreadend in
    0) ;;
    1) ;;
    2) ;;
    *)    echo "$usageHelp"
    echo "$shortHelp2"
    exit 1
esac

if [ -z "$site_file" ]
then
    site_file="${juiceDir}/restriction_sites/${genomeID}_${site}.txt"
fi

## Check that site file exists, needed for fragment number for merged_nodups
if [ ! -e "$site_file" ] && [ "$nofrag" -ne 1 ]
then
    echo "***! $site_file does not exist. It must be created before running this script."
    exit 1
fi

## Directories to be created and regex strings for listing files
splitdir=${topDir}"/splits"
donesplitdir=$topDir"/done_splits"
fastqdir=${topDir}"/fastq/*_R*.fastq*"
outputdir=${topDir}"/aligned"
tmpdir=${topDir}"/HIC_tmp"
logdir=${topDir}"/logs"

echo "dirs:"$fastqdir

## Check that fastq directory exists and has proper fastq files
if [ ! -d "${topDir}/fastq" ]; then
    echo "Directory \"${topDir}/fastq\" does not exist."
    echo "Create \"$topDir/fastq\" and put fastq files to be aligned there."
    echo "Type \"juicer.sh -h\" for help"
    exit 1
else
    if stat -t ${fastqdir} >/dev/null 2>&1
    then
    echo "(-: Looking for fastq files...fastq files exist"
    else
    if [ ! -d "$splitdir" ]; then
        echo "***! Failed to find any files matching ${fastqdir}"
        echo "***! Type \"juicer.sh -h \" for help"
        exit 1
    fi
    fi
fi

## Create output directory, only if not in merge, dedup, final, or postproc stages
if [[ -d "$outputdir" && -z "$final" && -z "$merge" && -z "$dedup" && -z "$postproc" ]]
then
    echo "***! Move or remove directory \"$outputdir\" before proceeding."
    echo "***! Type \"juicer.sh -h \" for help"
    exit 1
else
    if [[ -z "$final" && -z "$dedup" && -z "$merge" && -z "$postproc" ]]; then
        mkdir "$outputdir" || { echo "***! Unable to create ${outputdir}, check permissions." ; exit 1; }
    fi
fi

## Create split directory
if [ -d "$splitdir" ]; then
    splitdirexists=1
else
    mkdir "$splitdir" || { echo "***! Unable to create ${splitdir}, check permissions." ; exit 1; }
fi

## Create temporary directory, used for sort later
if [ ! -d "$tmpdir" ] && [ -z "$final" ] && [ -z "$dedup" ] && [ -z "$postproc" ]; then
    mkdir "$tmpdir"
    chmod 777 "$tmpdir"
fi

## Create a log directory, used to store log files (standout and standerr of each submitted jobs)
if [ ! -d "$logdir" ] && [ -z "$final" ] && [ -z "$dedup" ] && [ -z "$postproc" ]; then
    mkdir "$logdir"
    chmod 777 "$tmpdir"
fi

## Arguments have been checked and directories created. Now begins the real work of the pipeline
#source $usePath
#$load_cluster

## jobIDstring holds the jobIDs as they are submitted
countjobs=0
jobIDstring=""
#check the total fastq input files size, determine the threads number
#threads value could be customized according to user's job and cluster resources
fastqsize=$(ls -lL ${fastqdir} | awk '{sum+=$5}END{print sum}')
threads=2
if [ "$fastqsize" -gt "2592410750" ]
then
    threads=8
fi

testname=$(ls -l ${fastqdir} | awk 'NR==1;{print $9}')
if [ "${testname: -3}" == ".gz" ]
then
    skipsplit=1
    read1=${splitdir}"/*${read1str}*.fastq.gz"
else
    read1=${splitdir}"/*${read1str}*.fastq"
fi



## Not in merge, dedup, final, or postproc stage, i.e. need to split and align files.
if [[ -z "$final" && -z "$merge" && -z "$dedup" && -z "$postproc" ]]
then
    echo -e "(-: Aligning files matching $fastqdir\n in queue $queue to genome $genomeID with site file $site_file"

    ## Split fastq files into smaller portions for parallelizing alignment
    ## Do this by creating a text script file for the job on STDIN and then
    ## sending it to the cluster
    if [ ! $splitdirexists ]
    then
        echo "(-: Created $splitdir and $outputdir."
        if [ -n "$threads" ] && [ -z "$skipsplit" ]
        then
            echo " Splitting files"
            jIDs_spit=""
            for i in ${fastqdir}
            do
                filename=$(basename "$i")
                filename=${filename%.*}

                #wait till the previous qsub job has finished sucessfully
                #submitted job might get delayed due to time in the queue.
                timestamp=$(date +"%s" | cut -c 4-10)
                jID_split=$(qsub <<SPLITEND
                #PBS -S /bin/bash
                #PBS -q $queue
                #PBS -l $walltime
		#PBS -l select=1:ncpus=1:mem=20gb                
		#PBS -A $project
                #PBS -m a
                #PBS -o ${logdir}/${timestamp}_split_${filename}_${groupname}.log
                #PBS -j oe
                #PBS -N split_${filename}_${groupname}

                date +"%Y-%m-%d %H:%M:%S"
                echo "Split file: $filename"
                #Following line is used in coreutils >= 8.16, if not, simpler version is used below.
                #split -a 3 -l $splitsize -d --additional-suffix=.fastq ${i} $splitdir/$filename
                split -a 3 -l $splitsize -d ${i} ${splitdir}/${filename}
SPLITEND
)
                jIDs_split="${jIDs_split}:${jID_split}"
           done

            ## wait till the fastq spliting job has sucessfully finished
            ## Once split succeeds, rename the splits as fastq files
            ## PBS users change queue below to $queue
            timestamp=$(date +"%s" | cut -c 4-10)
            jID_splitmv=$(qsub << SPLITMV
            #PBS -S /bin/bash
            #PBS -q $queue
            #PBS -l $walltime
            #PBS -l select=1:ncpus=1:mem=20gb    
            #PBS -A $project
            #PBS -m a
            #PBS -o ${logdir}/${timestamp}_move_${groupname}.log
            #PBS -j oe
            #PBS -N move_${groupname}
            #PBS -W depend=afterok${jIDs_split}
            date +"%Y-%m-%d %H:%M:%S"
            for j in ${splitdir}/*;do mv \${j} \${j}.fastq;done
SPLITMV
)

        else # we're not splitting but just copying
            cp -rs ${fastqdir} ${splitdir}
            wait
        fi
    else # split dir already exists, no need to re-split fastqs
        echo -e "---  Using already created files in $splitdir\n"
    fi

    ## Launch job. Once split/move is done, set the parameters for the launch.

    echo "(-: Starting job to launch other jobs once splitting is complete"

    # Loop over all read1 fastq files and create jobs for aligning read1,
    # aligning read2, and merging the two. Keep track of merge names for final
    # merge. When merge jobs successfully finish, can launch final merge job.

    if [ -n "$threads" ] && [ -z "$skipsplit" ]
    then
        waitstring_alnwrp="#PBS -W depend=afterok:${jID_splitmv}"
    fi
    echo "waitstring_alnwrp is ${waitstring_alnwrp}"

    timestamp=$(date +"%s" | cut -c 4-10)
    qsub <<ALIGNWRAP
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $walltime
    #PBS -l select=1:ncpus=1:mem=6gb
    #PBS -o ${logdir}/${timestamp}_alnwrap_${groupname}.log
    #PBS -j oe
    #PBS -N AlnWrp${groupname}
    #PBS -A $project
    $waitstring_alnwrp
    #PBS -m a
    for i in ${read1}
    do

        countjobs=\$(( countjobs + 1 ))
        ext=\${i#*$read1str}
        name=\${i%$read1str*}
        # these names have to be right or it'll break
        name1=\${name}${read1str}
        name2=\${name}${read2str}
        jname=\$(basename "\$name")\${ext}
        usegzip=0
        if [ "\${ext: -3}" == ".gz" ]
        then
            usegzip=1
        fi

        ## count ligations
        timestamp=\$(date +"%s" | cut -c 4-10)
        qsub <<-CNTLIG
        #PBS -S /bin/bash
        #PBS -q $queue
        #PBS -l $walltime
        #PBS -l select=1:ncpus=1:mem=4gb
	#PBS -A $project 
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_\${jname}_CntLig_\${countjobs}_${groupname}.log
        #PBS -j oe
        #PBS -N CtLig\${countjobs}${groupname}
        #PBS -v name=\${name}
        #PBS -v name1=\${name1}
        #PBS -v name2=\${name2}
        #PBS -v ext=\${ext}
        #PBS -v ligation=${ligation}
        #PBS -v usezip=\${usezip}

        date +"%Y-%m-%d %H:%M:%S"
        export usegzip=\${usegzip}
        export name=\${name}
        export name1=\${name1}
        export name2=\${name2}
        export ext=\${ext}
        export ligation=${ligation}
        ${juiceDir}/scripts/countligations.sh
CNTLIG
        jID_cntlig=\$(qstat | grep "CtLig\${countjobs}${groupname}" | cut -d '.' -f 1 )
        echo "jID_cntlig \${countjobs} id is \${jID_cntlig}"
        ## Align read1
        # align read1 fastq
        # allocate memory for alignment according to threads number
        # set-up the max memory according to the cluster set-up
        # this local cluster uses shared mem per node when in resource quest
        alloc_mem_value=\$(( threads * 5))
        echo "allocmem:"\$alloc_mem_value
	alloc_mem_vale=100
        alloc_mem=\${alloc_mem_value}"gb"
        #set the max mem cap accordingly to ensure the job resource request meet the cluster-setup
        if [ \$alloc_mem_value -gt 240 ]
        then
            alloc_mem=240gb
        fi

        echo "starting read1 alignemnt"
	echo "threads:"\$threads "mem:"\$allocmem
        timestamp=\$(date +"%s" | cut -c 4-10)
        qsub <<ALGNR1
        #PBS -S /bin/bash
        #PBS -q $queue
        #PBS -l $walltime
        # #PBS -l select=1:ncpus=\${threads}:mem=\${alloc_mem}
        #PBS -l select=1:ncpus=8:mem=80gb
        #PBS -A $project
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_\${jname}_align1_\${countjobs}_${groupname}.log
        #PBS -j oe
        #PBS -N ALN1\${countjobs}${groupname}
        #PBS -W depend=afterok:\$jID_cntlig
        #PBS -v name=\${name}
        #PBS -v name1=\${name1}
        #PBS -v ext=\${ext}

        date +"%Y-%m-%d %H:%M:%S"
        $load_bwa
        if [ -n "$shortread" ] || [ "$shortreadend" -eq 1 ]
        then
            echo 'Running command bwa aln -q 15 $refSeq \${name1}\${ext} > \${name1}\${ext}.sai && bwa samse $refSeq \${name1}\${ext}.sai \${name1}\${ext} > \${name1}\${ext}.sam'
            bwa aln -q 15 $refSeq \${name1}\${ext} > \${name1}\${ext}.sai && bwa samse $refSeq \${name1}\${ext}.sai \${name1}\${ext} > \${name1}\${ext}.sam
            if [ \$? -ne 0 ]
            then
                echo "Alignment of \${name1}\${ext} failed. Check ${topDir}/pbs.out for results"
                exit 1
            else
                echo " Short align of \${name1}\${ext}.sam done successfully"
            fi
        else
            echo 'Running command bwa mem \$threads $refSeq \${name1}\${ext} > \${name1}\${ext}.sam '
            bwa mem -t 8 $refSeq \${name1}\${ext} > \${name1}\${ext}.sam
            if [ \$? -ne 0 ]
            then
                echo "***!Error, failed to align \${name1}\${ext}"
                exit 1
            else
                echo "Mem align of \${name1}\${ext}.sam done successfully"
            fi
        fi
        echo "below is the number of lines in align1 .sam file"
        cat \${name1}\${ext}.sam |wc -l
ALGNR1
        wait
        # Get the jobID from qstat ouput by searching job specific string,"read1\${countjobs}" , in job Name.
        jID_1=\$( qstat | grep "ALN1\${countjobs}${groupname}" |cut -d '.' -f 1 )
        echo "align1 \$countjobs id is: \${jID_1}"
        # Align read2
        echo "starting read2 alignment"
        # align read2 fastq
        timestamp=\$(date +"%s" | cut -c 4-10)
        qsub <<ALGNR2
        #PBS -S /bin/bash
        #PBS -q $queue
        #PBS -l $walltime
        # #PBS -l select=1:ncpus=${threads}:mem=\$alloc_mem
        #PBS -l select=1:ncpus=8:mem=80gb
        #PBS -A $project 
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_\${jname}_align2_\${countjobs}_${groupname}.log
        #PBS -j oe
        #PBS -N ALN2\${countjobs}${groupname}
        #PBS -W depend=afterok:\$jID_cntlig
        #PBS -v name=\${name}
        #PBS -v name2=\${name2}
        #PBS -v ext=\${ext}

        date +"%Y-%m-%d %H:%M:%S"
        $load_bwa
        if [ -n "$shortread" ] || [ "$shortreadend" -eq 2 ]
        then
            echo 'Running command bwa aln -q 15 $refSeq \${name2}\${ext} > \${name2}\${ext}.sai && bwa samse $refSeq \${name2}\${ext}.sai \${name2}\${ext} > \${name2}\${ext}.sam '
            bwa aln -q 15 $refSeq \${name2}\${ext} > \${name2}\${ext}.sai && bwa samse $refSeq \${name2}\${ext}.sai \${name2}\${ext} > \${name2}\${ext}.sam
            if [ \$? -ne 0 ]
            then
                echo "Alignment of \${name2}\${ext} failed. Check ${topDir}/pbs.out for results"
                exit 1
            else
                echo "Short align of \${name2}\${ext}.sam done successfully"
            fi

        else
            echo 'Running command bwa mem \$threads $refSeq \${name2}\${ext} > \${name2}\${ext}.sam'
            bwa mem -t 8 $refSeq \${name2}\${ext} > \${name2}\${ext}.sam
            if [ \$? -ne 0 ]
            then
                exit 1
            else
                echo "Mem align of \${name2}\${ext}.sam done successfully"
            fi
        fi
        echo "below is the number of lines in aligned .sam file before sorting"
        cat \${name2}\${ext}.sam |wc -l
ALGNR2
        wait

        # Get the jobID of job above from qstat ouput using job specific string,"read2\${countjobs}" , in jobName.
        jID_2=\$(qstat | grep "ALN2\${countjobs}${groupname}" |cut -d '.' -f 1 )
        echo "align2 \${countjobs} id is: \${jID_2}"
        echo "starting merging from read1 and read2"
        # wait for align1 and align2 jobs finish,then merge
        timestamp=\$(date +"%s" | cut -c 4-10)
        qsub <<- MRGALL
        #PBS -S /bin/bash
        #PBS -q $queue
        #PBS -l $long_walltime
	#PBS -l select=1:ncpus=1:mem=24gb        
        #PBS -A $project
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_\${jname}_merge_\${countjobs}_${groupname}.log
        #PBS -j oe
        #PBS -N Mrg\${countjobs}${groupname}
        #PBS -W depend=afterok:\${jID_1}:\${jID_2}
        #PBS -v name=\${name}
        #PBS -v name1=\${name1}
        #PBS -v name2=\${name2}
        #PBS -v ext=\${ext}
        #PBS -v countjobs=\${countjobs}

        date +"%Y-%m-%d %H:%M:%S"
        export LC_ALL=C

        # sort read 1 aligned file by readname
        sort -T $tmpdir -k1,1f \${name1}\${ext}.sam > \${name1}\${ext}_sort.sam
        if [ \$? -ne 0 ]
        then
            echo "***! Error while sorting \${name1}\${ext}.sam"
            echo "Sort of \${name1}\${ext}.sam failed."
            exit 1
        else
            echo "Sort read 1 aligned file by readname completed."
        fi
        echo "below is the number of lines in sorted read1 .sam files"
        cat \${name1}\${ext}_sort.sam | wc -l

        # sort read 2 aligned file by readname
        sort -T $tmpdir -k1,1f \${name2}\${ext}.sam > \${name2}\${ext}_sort.sam
        if [ \$? -ne 0 ]
        then
            echo "***! Error while sorting \${name2}\${ext}.sam"
            echo "Sort of \${name2}\${ext}.sam failed."
            exit 1
        else
            echo "Sort read 2 aligned file by readname completed."
        fi
        echo "below is the number of lines in sorted read2 .sam files"
        cat \${name2}\${ext}_sort.sam | wc -l

        # remove header, add read end indicator to read name
        awk -f ${juiceDir}/scripts/read1_sortproc.awk \${name1}\${ext}_sort.sam > \${name1}\${ext}_sort1.sam
        awk -f ${juiceDir}/scripts/read2_sortproc.awk \${name2}\${ext}_sort.sam > \${name2}\${ext}_sort1.sam
        echo "below is the number of lines in \${name1}\${ext}_sort1.sam"
        cat \${name1}\${ext}_sort1.sam | wc -l
        echo "below is the number of lines in \${name2}\${ext}_sort1.sam"
        cat \${name2}\${ext}_sort1.sam | wc -l

        # merge the two sorted read end files
        sort -T $tmpdir -k1,1f -m \${name1}\${ext}_sort1.sam \${name2}\${ext}_sort1.sam > \${name}\${ext}.sam
        if [ $? -ne 0 ]
        then
            echo "***! Failure during merge of read files"
            echo "Merge of \${name}\${ext}.sam failed"
            exit 1
        else
            rm \${name1}\${ext}.sa* \${name2}\${ext}.sa* \${name1}\${ext}_sort*.sam \${name2}\${ext}_sort*.sam
            echo "\${name}\$\{ext}.sam created successfully."
        fi
        echo "below is the number of lines in \${name}\${ext}.sam after merging sorted aln1 and aln2"
        cat \${name}\${ext}.sam | wc -l

MRGALL
    wait

    jID_3=\$(qstat | grep "Mrg\${countjobs}${groupname}" | cut -d '.' -f 1 )
    echo "merging align1 and align2 \${coutjobs} id is \${jID_3}"
    echo "starting chimeric step after alignment"
    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<- CHIMERIC
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $walltime
    #PBS -l select=1:ncpus=1:mem=24gb
    #PBS -A $project
    #PBS -m a
    #PBS -o ${logdir}/\${timestamp}_\${jname}_chimeric_\${countjobs}_${groupname}.log
    #PBS -j oe
    #PBS -N Chmr\${countjobs}${groupname}
    #PBS -W depend=afterok:\${jID_3}
    #PBS -v name=\${name}
    #PBS -v ext=\${ext}

    date +"%Y-%m-%d %H:%M:%S"
    export LC_ALL=C
    # call chimeric_blacklist.awk to deal with chimeric reads; sorted file is sorted by read name at this point
    awk -v fname1="\${name}\${ext}_norm.txt" -v fname2="\${name}\${ext}_abnorm.sam" -v fname3="\${name}\${ext}_unmapped.sam" -f ${juiceDir}/scripts/chimeric_blacklist.awk \${name}\${ext}.sam

    if [ \$? -ne 0 ]
    then
        echo "***! Failure during chimera handling of \${name}\${ext}"
        echo "Chimera handling of \${name}\${ext}.sam failed."
        exit 1
    fi
    # if any normal reads were written, find what fragment they correspond to and store that
    if [ -e "\${name}\${ext}_norm.txt" ] && [ "$site" != "none" ]
    then
        ${juiceDir}/scripts/fragment.pl \${name}\${ext}_norm.txt \${name}\${ext}.frag.txt $site_file
    elif [ "$site" == "none" ]
    then
        awk -f ${juiceDir}/scripts/chimeric_nonsites.awk \${name}\${ext}_norm.txt > \${name}\${ext}.frag.txt
    else
        echo "***! No \${name}\${ext}_norm.txt file created"
        echo "Creation of \${name}\${ext}_norm.txt failed."
        exit 1
    fi
    if [ \$? -ne 0 ]
    then
        echo "***! Failure during fragment assignment of \${name}\${ext}"
        echo "Fragment assignment of \${name}\${ext}.sam failed."
        exit 1
    fi
    # sort by chromosome, fragment, strand, and position
    echo "Sorting..."
    echo \${name}\${ext}
    sort -T $tmpdir -k2,2d -k6,6d -k4,4n -k8,8n -k1,1n -k5,5n -k3,3n \${name}\${ext}.frag.txt > \${name}\${ext}.sort.txt
    if [ \$? -ne 0 ]
    then
        echo "***! Failure during sort of \${name}\${ext}"
        echo "Sort of \${name}\${ext}.frag.txt failed."
        exit 1
    else
        echo "removing temperary files \${name}\${ext}_norm.txt \${name}\${ext}.frag.txt"
        rm \${name}\${ext}_norm.txt \${name}\${ext}.frag.txt
    fi
CHIMERIC
    wait

    jID_4=\$(qstat | grep "Chmr\${countjobs}${groupname}" | cut -d '.' -f 1)
    echo "chimeric \$countjobs id is \$jID_4"
    exitstatus=\$(qstat -f \${jID_4} |grep "exit_status" )
    echo "the exit status of \${jID_4} is \${exitstatus}"
    jobIDstring="\${jobIDstring}:\${jID_4}"
    echo "jobIDstring \$countjobs is \${jobIDstring}"

    # done looping over all fastq split files
    done


    # if error occored, we will kill the remaining jobs
    # output an error message of error detection and killing the remaining jobs
    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<- CKALIGNFAIL
    #PBS -S /bin/bash
    #PBS -q $queue  
    #PBS -l select=1:ncpus=1:mem=2gb   
    #PBS -l $walltime
    #PBS -A $project
    #PBS -m a
    #PBS -o ${logdir}/\${timestamp}_check_alnOK_${groupname}.log
    #PBS -j oe
    #PBS -W depend=afterok\${jobIDstring}
    #PBS -N AlnOK_${groupname}

    date +"%Y-%m-%d %H:%M:%S"
    echo "Sucess: All alignment jobs were successfully finished without failure!"
CKALIGNFAIL

    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<- CKALIGNFAILCLN
    #PBS -S /bin/bash
    #PBS -q $queue  
    #PBS -l select=1:ncpus=1:mem=4gb    
    #PBS -l $walltime
    #PBS -A $project
    #PBS -o ${logdir}/\${timestamp}_alignfailclean_${groupname}.log
    #PBS -j oe
    #PBS -W depend=afternotok\${jobIDstring}
    #PBS -N Alncln${groupname}

    date +"%Y-%m-%d %H:%M:%S"
    echo "Error with alignment jobs, deleting all remaining jobs of this pipeline."
    echo "next line.."
    RemJob=\$(qstat |grep "$groupname" |grep " Q \| H \| R " | awk 'BEGIN{FS=" "}{print $1}'| cut -d '.' -f 1)    
    echo \${RemJob}
    echo "trying qdel with no slash"
    qdel ${RemJob}
    echo "trying qdel with slash"
    qdel \${RemJob}

CKALIGNFAILCLN

ALIGNWRAP
    #done fastq alignment && alignment jobs failure checking.

fi
jID_alignwrap=$( qstat | grep AlnWrp${groupname} | cut -d '.' -f 1 )
if [ -z $merge ]
then
    waitstring_mrgsrtwrp="#PBS -W depend=afterok:${jID_alignwrap}"
fi
echo "below is the jID_alignwrap jobid"
echo ${jID_alignwrap}
echo ${waitstring_mrgsrtwrp}
if [ -z $final ] && [ -z $dedup ] && [ -z $postproc ]
then
    ## merge the sorted files into one giant file that is also sorted.
    ## change queue below to $long_queue
    timestamp=$(date +"%s" | cut -c 4-10)
    qsub <<MRGSRTWRAP
    #PBS -S /bin/bash
    #PBS -q $queue  
    #PBS -l select=1:ncpus=1:mem=24gb
    #PBS -l $walltime
    #PBS -A $project
    #PBS -m a
    #PBS -o ${logdir}/${timestamp}_mergesortwrap_${groupname}.log
    #PBS -j oe
    #PBS -N MStWrp${groupname}
    ${waitstring_mrgsrtwrp}
    date +"%Y-%m-%d %H:%M:%S"
    echo "all alignment done, all splitting and alignment jobs succeeded!" 
    jID_alnOK=\$( qstat | grep AlnOK_${groupname} | cut -d '.' -f 1 )
    echo "jID_aln-OK job id is \$jID_alnOK "
    timestamp=\$(date +"%s" | cut -c 4-10)
    if [ -z $merge ]
    then
        waitstring_alnOK="#PBS -W depend=afterok:\${jID_alnOK}"
    fi
    echo "waitstring_anlOK is \${waitstring_alnOK}"
    echo "below without backslash"
    echo ${waitstring_alnOK}
    echo "below with backslash"
    echo \${waitstring_alnOK}
    qsub <<MRGSRT
        #PBS -S /bin/bash
        #PBS -q $queue  
	#PBS -l select=1:ncpus=1:mem=24gb        
        #PBS -l $walltime
        #PBS -A $project
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_fragmerge_${groupname}.log
        #PBS -j oe
        #PBS -N frgmrg${groupname}
        \${waitstring_alnOK}
        date +"%Y-%m-%d %H:%M:%S"
        export LC_ALL=C
        if [ -d $donesplitdir ]
        then
            mv $donesplitdir/* $splitdir/.
        fi
        if ! sort -T $tmpdir -m -k2,2d -k6,6d -k4,4n -k8,8n -k1,1n -k5,5n -k3,3n $splitdir/*.sort.txt  > $outputdir/merged_sort.txt
        then
            echo "***! Some problems occurred somewhere in creating  sorted align files."
        else
            echo "Finished sorting all sorted files into a single merge."
            rm -r ${tmpdir}
        fi
MRGSRT
        jID_mrgsrt=\$( qstat | grep frgmrg${groupname} | cut -d '.' -f 1 )

        ##kill all remaining jobs if previous mergesort step exited with error
        timestamp=\$(date +"%s" | cut -c 4-10)
        qsub <<MRGSRTFAILCK
        #PBS -S /bin/bash
        #PBS -q $queue  
	#PBS -l select=1:ncpus=1:mem=2gb        
        #PBS -l $walltime
        #PBS -A $project
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_clean1_${groupname}.log
        #PBS -j oe
        #PBS -N clean1${groupname}
        #PBS -W depend=afternotok:\${jID_mrgsrt}

        date +"%Y-%m-%d %H:%M:%S"
        echo "Error with merging sorted files job, ${jID_mrgsort}, deleting all remaining jobs of this pipeline."
        RemJob1=\$(qstat |grep "$groupname" |grep " Q \| H \| R " | awk 'BEGIN{FS=" "}{print $1}'| cut -d '.' -f 1)
        qdel \${RemJob1}
MRGSRTFAILCK
MRGSRTWRAP
fi
wait
jID_mrgsrtwrap=$( qstat| grep MStWrp${groupname} | cut -d '.' -f 1 )
if [ -z $dedup ]
then
    waitstring_RDpWrp="#PBS -W depend=afterok:${jID_mrgsrtwrap}"
fi
echo "waitstring_RDpWrp is below:"
echo ${waitstring_RDpWrp}
wait
if [ -z $final ] && [ -z $postproc ]
then
    ##remove duplicates from the big sorted file if merge sorted job exited successfully
    timestamp=$(date +"%s" | cut -c 4-10)
    qsub <<RMDUPWRAP
    #PBS -S /bin/bash
    #PBS -q $queue  
    #PBS -l select=1:ncpus=1:mem=4gb
    #PBS -l $walltime
    #PBS -A $project
    #PBS -m a
    #PBS -o ${logdir}/${timestamp}_rmdupwrap_${groupname}.log
    #PBS -j oe
    #PBS -N RDpWrp${groupname}
    ${waitstring_RDpWrp}
    date +"%Y-%m-%d %H:%M:%S"
    echo ${waitstring_RDpWrp}
    jID_mrgsrt=\$( qstat | grep frgmrg${groupname} | cut -d '.' -f 1 )
    if [ -z $dedup ]
    then
        waitstring_osplit="#PBS -W depend=afterok:\${jID_mrgsrt}"
    fi
    echo "jID_mrgsrt jobid is \$jID_mrgsrt "
    echo "waitstring_osplit is:\${waitstring_osplit}"
    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<RMDUPLICATE
        #PBS -S /bin/bash
        #PBS -q $queue  
	#PBS -l select=1:ncpus=1:mem=4gb        
        #PBS -l $walltime
        #PBS -A $project
        #PBS -m a
        #PBS -o ${logdir}/\${timestamp}_osplit_${groupname}.log
        #PBS -j oe
        #PBS -N osplit${groupname}
        #PBS -v timestamp=\${timestamp}
        \${waitstring_osplit}
        date +"%Y-%m-%d %H:%M:%S"
        echo "Sucess: All mergefragments jobs were successfully finished!"
        echo "now starts to remove duplicates from the big sorted file"
        awk -v project=${project} -v queue=${long_queue} -v outfile=${logdir}/\${timestamp}_awksplit_rmdunps -v juicedir=${juiceDir} -v dir=$outputdir -v groupname=$groupname -v walltime=$long_walltime -f ${juiceDir}/scripts/split_rmdups.awk $outputdir/merged_sort.txt
RMDUPLICATE

RMDUPWRAP
fi

jID_rmdupwrap=$( qstat | grep RDpWrp${groupname} | cut -d '.' -f 1 )
echo "jID_rmdupwrap ID: $jID_rmdupwrap"
wait
if [ -z "$genomePath" ]
then
    #If no path to genome is given, use genome ID as default.
    genomePath=$genomeID
fi

# if early exit, we stop here, once the merged_nodups.txt file is created.
if [ -z "$earlyexit" ]
then
    waitstring0="#PBS -W depend=afterok:${jID_rmdupwrap}"

    #Skip if post-processing only is required
    if [ -z $postproc ]
    then
        if [ -z $final ]
        then
            echo "final not set, superwrap1 job depend=afterok:jID_rmdupwrap"
        else
            waitstring0=""
        fi
        echo "waitstring0 is: $waitstring0"
        timestamp=$(date +"%s" | cut -c 4-10)
		qsub <<SUPERWRAP1
        #PBS -S /bin/bash
        #PBS -q $queue
	#PBS -l select=1:ncpus=1:mem=1gb        
        #PBS -l $walltime
        #PBS -A $project
        #PBS -m a
        #PBS -o ${logdir}/${timestamp}_superwrap1_${groupname}.log
        #PBS -j oe
        #PBS -N SpWrp1${groupname}
        ${waitstring0}
        timestamp=$(date +"%s" | cut -c 4-10)
        echo "start submitting the lauch job!"
        export groupname=$groupname
        export juiceDir=$juiceDir
        export load_java="${load_java}"
        export about=$about
        export site_file=$site_file
        export ligation=$ligation
        export logdir=${logdir}
        export outputdir=$outputdir
        export splitdir=$splitdir
        export nofrag=$nofrag
        export genomePath=$genomePath
        export final=$final
        export queue=$queue
        export walltime=$walltime
        export long_walltime=$long_walltime
        export nofrag=$nofrag
        export project=$project
	${juiceDir}/scripts/launch_stats.sh
SUPERWRAP1
    fi
    jID_superwrap1="$( qstat | grep SpWrp1${groupname} | cut -d '.' -f 1 )"
    wait
    if [ -z $postproc ]
    then
        waitstring6="#PBS -W depend=afterany:${jID_superwrap1}"
    fi
    wait
    echo "waitstring6 is : ${waitstring6}"

    timestamp=$(date +"%s" | cut -c 4-10)
    qsub <<SUPERWRAP2
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $walltime
    #PBS -l select=1:ncpus=1:mem=4gb   
    #PBS -o ${logdir}/${timestamp}_super_wrap2_${groupname}.log
    #PBS -j oe
    #PBS -N SpWrp2${groupname}
    #PBS -A $project
    #PBS -m a
    ${waitstring6}
    wait
    export groupname=$groupname
    export load_java="${load_java}"
    export load_cuda="${load_cuda}"
    export juiceDir=$juiceDir
    export genomeID=$genomeID
    export outputdir=$outputdir
    export logdir=${logdir}
    export splitdir=$splitdir
    export queue=$queue
    export walltime=$walltime
    export long_walltime=$long_walltime
    export postproc=$postproc
    export project=$project
    jID_launch=\$(qstat | grep Lnch_${groupname} | cut -d '.' -f 1)
    echo \$jID_launch
    echo "waitstring3 is :\${waitstring3} : probably blank here, executing postprocessing.sh"
    ${juiceDir}/scripts/postprocessing.sh
    echo "done postprocessing.sh"
SUPERWRAP2



## After removing duplicates,if early exit is set,we directly go to the final check step.
else
    echo "earlyexit is set, stat,hic, and postprocess were not done."
    timestamp=$(date +"%s" | cut -c 4-10)
    qsub <<FINCK2
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $walltime
    #PBS -l select=1:ncpus=1:mem=1gb
    #PBS -o ${logdir}/${timestamp}_prep_done_${groupname}.out
    #PBS -j oe
    #PBS -A $project
    #PBS -m a
    #PBS -N prepd_${groupname}
    #PBS -W depend=afterok:${jID_rmdupwrap}
    date +"%Y-%m-%d %H:%M:%S"

    jID_osplit=\$( qstat | grep osplit${groupname} | cut -d '.' -f 1 )
    jID_rmsplit=\$( qstat | grep RmSplt${groupname} | cut -d '.' -f 1)        
    wait
    timestamp=\$(date +"%s" | cut -c 4-10)
    qsub <<PREPDONE
        #PBS -S /bin/bash
        #PBS -q $queue
        #PBS -l $walltime
	#PBS -l select=1:ncpus=1:mem=1gb
        #PBS -o ${logdir}/\${timestamp}_done_${groupname}.log
        #PBS -j oe
        #PBS -N ${groupname}_done
        #PBS -A $project
        #PBS -m a
        #PBS -W depend=afterok:\${jID_osplit}:\${jID_rmsplit}

        date +"%Y-%m-%d %H:%M:%S"
        export splitdir=${splitdir}
        export outputdir=${outputdir}
        ${juiceDir}/scripts/check.sh
PREPDONE
FINCK2

fi
echo "Finished adding all jobs... please wait while processing."
qstat
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
echo "check the variables:"
echo "${juiceDir}/scripts/launch_stats.sh"
echo "genomename: $groupname"
echo "juiceDir: $juiceDir"
echo "about: $about"
echo "site_file: $site_file"
echo "logdir: ${logdir}"
echo "ligation: $ligation"
echo "outputdir: $outputdir"
echo "splitdir: $splitdir"
echo "nofrag: $nofrag"
echo "final: $final"
echo "genomePath: $genomePath"
echo "queue $queue"
echo "resolutions: $resolutions"
echo "walltime: $walltime"
echo "long_walltime: $long_walltime"
echo "project: $project"
echo $load_java
jID_osplit=$( qstat | grep osplit${groupname} | cut -d '.' -f 1 )

if [ -z $final ]
then
	 waitstring2="#PBS -W depend=afterok:${jID_osplit}"
fi

echo "jID_osplit: $jID_osplit"
echo "waitstring2 is"
echo $waitstring2
timestamp=$(date +"%s" | cut -c 4-10)
qsub <<DOSTAT
#PBS -S /bin/bash
#PBS -q $queue
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -l $walltime
#PBS -o ${logdir}/${timestamp}_launch_${groupname}.log
#PBS -j oe
#PBS -N Lnch_${groupname}
#PBS -A $project
${waitstring2}

date +"%Y-%m-%d %H:%M:%S"
echo "Alignment and merge done, launching the stats job"
#jID_osplit=\$( qstat | grep osplit${groupname} | cut -d '.' -f 1 )
jID_rmsplit=\$( qstat | grep RmSplt${groupname} | cut -d '.' -f 1)

if [ -z $final ]
then
    waitstring22="#PBS -W depend=afterok:\${jID_rmsplit}"
fi
echo "waitstring22 is below"
echo ${waitstring22}
echo "jID_rmsplit value is: \${jID_rmsplit}"
echo "this is the value of waitstring22: \${waitstring22}"
timestamp=\$(date +"%s" | cut -c 4-10)
echo "start sbumitting stats job"
qsub <<STATS0
	#PBS -S /bin/bash
	#PBS -q $queue
	#PBS -l select=1:ncpus=1:mem=20gb
	#PBS -l $walltime
	\${waitstring22}
	#PBS -o ${logdir}/\${timestamp}_stats0_${groupname}.log
	#PBS -j oe
	#PBS -N stats0${groupname}
	#PBS -A $project

	date +"%Y-%m-%d %H:%M:%S"
	$load_java
	export _JAVA_OPTIONS=-Xmx16384m; 
	export LC_ALL=en_US.UTF-8

	tail -n1 $headfile | awk '{printf"%-1000s\n", \\\$0}' > $outputdir/inter.txt 
	${juiceDir}/scripts/statistics.pl -s $site_file -l $ligation -o $outputdir/stats_dups.txt $outputdir/dups.txt
	cat $splitdir/*.res.txt | awk -f ${juiceDir}/scripts/stats_sub.awk >> $outputdir/inter.txt
	java -cp ${juiceDir}/scripts/ LibraryComplexity $outputdir inter.txt >> $outputdir/inter.txt
	${juiceDir}/scripts/statistics.pl -s $site_file -l $ligation -o $outputdir/inter.txt -q 1 $outputdir/merged_nodups.txt

STATS0

qsub <<STATS30
	#PBS -S /bin/bash
	#PBS -q $queue
	#PBS -l select=1:ncpus=1:mem=20gb 
	#PBS -l $walltime
	#PBS -o ${logdir}/\${timestamp}_stats_${groupname}.log
	#PBS -j oe
	#PBS -N stats30${groupname}
	#PBS -A $project
	\${waitstring22}

	date +"%Y-%m-%d %H:%M:%S"
	$load_java
	export _JAVA_OPTIONS=-Xmx16384m; 
	export LC_ALL=en_US.UTF-8
	echo -e 'Experiment description: $about' > $outputdir/inter_30.txt; 
	tail -n1 $headfile | awk '{printf"%-1000s\n", \\\$0}' > $outputdir/inter_30.txt 
	cat $splitdir/*.res.txt | awk -f ${juiceDir}/scripts/stats_sub.awk >> $outputdir/inter_30.txt
	java -cp ${juiceDir}/scripts/ LibraryComplexity $outputdir inter_30.txt >> $outputdir/inter_30.txt
	${juiceDir}/scripts/statistics.pl -s $site_file -l $ligation -o $outputdir/inter_30.txt -q 30 $outputdir/merged_nodups.txt

STATS30


qsub <<- ABNORMAL
	#PBS -S /bin/bash
	#PBS -q $queue
	#PBS -l select=1:ncpus=1:mem=60gb 
	#PBS -l $long_walltime
	#PBS -o ${logdir}/\${timestamp}_abnormal_${groupname}.log
	#PBS -j oe
	#PBS -N abnorm_${groupname}
	#PBS -A $project
	\$waitstring22

	cat $splitdir/*_abnorm.sam > $outputdir/abnormal.sam
	cat $splitdir/*_unmapped.sam > $outputdir/unmapped.sam
	awk -f ${juiceDir}/scripts/collisions.awk $outputdir/abnormal.sam > $outputdir/collisions.txt
ABNORMAL

jID_stats0=\$( qstat | grep stats0${groupname} | cut -d '.' -f 1)

wait
echo "this is the value of jID_stats: \${jID_stats}"
timestamp=\$(date +"%s" | cut -c 4-10)
echo "start submitting hic job"
qsub <<- HICWORK
	#PBS -S /bin/bash
	#PBS -q $queue
	#PBS -l select=1:ncpus=2:mem=60gb
	#PBS -l $long_walltime
	#PBS -o ${logdir}/\${timestamp}_hic0_${groupname}.log
	#PBS -j oe
	#PBS -N hic0_${groupname}
	#PBS -W depend=afterok:\${jID_stats0}
	#PBS -A $project

	date +"%Y-%m-%d %H:%M:%S"
	echo "finished stats job,now launching the hic job."    
	${load_java}
	export _JAVA_OPTIONS=-Xmx16384m
	if [ \"$nofrag\" -eq 1 ]
	then
		echo "running juicer_tools 1st..." 
		${juiceDir}/scripts/juicer_tools pre -s $outputdir/inter.txt -g $outputdir/inter_hists.m -q 1 $outputdir/merged_nodups.txt $outputdir/inter.hic $genomePath
	else 
		echo "running juicer_tools 2nd..."
		${juiceDir}/scripts/juicer_tools pre -f $site_file -s $outputdir/inter.txt -g $outputdir/inter_hists.m -q 1 $outputdir/merged_nodups.txt $outputdir/inter.hic $genomePath
	fi
HICWORK
jID_stats30=\$( qstat | grep stats30${groupname} | cut -d '.' -f 1)
timestamp=\$(date +"%s" | cut -c 4-10)
echo "start submitting hic30 job"
qsub <<- HIC30WORK
	#PBS -S /bin/bash
	#PBS -q $queue
	#PBS -l select=1:ncpus=2:mem=60gb
	#PBS -l $long_walltime
	#PBS -o ${logdir}/\${timestamp}_hic30_${groupname}.log
	#PBS -j oe
	#PBS -N hic30_${groupname}
	#PBS -W depend=afterok:\${jID_stats30}
	#PBS -A $project

	date +"%Y-%m-%d %H:%M:%S"
	$load_java
	export _JAVA_OPTIONS=-Xmx16384m
	export LC_ALL=en_US.UTF-8
	echo "running hic30"
	echo $nofrag
	echo -e 'Experiment description: $about' > $outputdir/inter_30.txt
	cat $splitdir/*.res.txt | awk -f ${juiceDir}/scripts/stats_sub.awk >> $outputdir/inter_30.txt
	java -cp ${juiceDir}/scripts/ LibraryComplexity $outputdir inter_30.txt >> $outputdir/inter_30.txt
	${juiceDir}/scripts/statistics.pl -s $site_file -l $ligation -o $outputdir/inter_30.txt -q 30 $outputdir/merged_nodups.txt
	if [  \"$nofrag\" -eq 1 ]
	then
		echo "running hic30 1st..." 
	${juiceDir}/scripts/juicer_tools pre -s $outputdir/inter_30.txt -g $outputdir/inter_30_hists.m -q 30 $outputdir/merged_nodups.txt $outputdir/inter_30.hic $genomePath
	else 
		echo "running hic30 2nd..."
		${juiceDir}/scripts/juicer_tools pre -f $site_file -s $outputdir/inter_30.txt -g $outputdir/inter_30_hists.m -q 30 $outputdir/merged_nodups.txt $outputdir/inter_30.hic $genomePath
	fi
HIC30WORK

DOSTAT
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
juicer_version="1.5" 
## Set the following variables to work with your system

## use cluster load commands:
#usePath=""
load_java="module load java/jdk1.8.0_131"
load_cuda="module load cuda/7.5.18/gcc/4.4.7"
# Juicer directory, contains scripts/ and restriction_sites/
juiceDir="/lustre1/mzhibo/hic/apps/juicer"
# default queue and time
#queue="batch"
#walltime="walltime=12:00:00"

# unique name for jobs in this run
groupname="C$(date +"%s"|cut -c 6-11)"


## Default options, overridden by command line arguments

# top level directory, can also be set in options
topDir=$(pwd)
# restriction enzyme, can also be set in options
site="MboI"
# genome ID, default to human, can also be set in options
genomeID="hg19"

## Read arguments
usageHelp="Usage: ${0##*/} -g genomeID [-d topDir] [-s site] [-r resolutions] [-hx]"
genomeHelp="   genomeID must be defined in the script, e.g. \"hg19\" or \"mm10\" (default \"$genomeID\")"
dirHelp="   [topDir] is the top level directory (default \"$topDir\") and must contain links to all merged_nodups files underneath it"
siteHelp="   [site] must be defined in the script, e.g.  \"HindIII\" or \"MboI\" (default \"$site\"); alternatively, this can be the restriction site file"
resolutionsHelp="   [resolutions] is a comma-delimited list of resolutions, such as 10000,5000,1000,5f (default is 2.5M,1M,500K,250K,100K,50K,25K,10K,5K in base pair and 500f,250f,100f,50f,25f,10f,5f,2f,1f)"
excludeHelp="   -x: exclude fragment-delimited maps from Hi-C mega map (will run much faster)"
helpHelp="   -h: print this help and exit"

printHelpAndExit() {
    echo "$usageHelp"
    echo "$genomeHelp"
    echo "$dirHelp"
    echo "$siteHelp"
    echo "$resolutionsHelp"
    echo "$excludeHelp"
    echo "$helpHelp"
    exit "$1"
}

while getopts "d:g:r:h:x:s" opt; do
    case $opt in
	g) genomeID=$OPTARG ;;
	h) printHelpAndExit 0;;
	d) topDir=$OPTARG ;;
	s) site=$OPTARG ;;
	x) exclude=1 ;;
	r) resolutions=$OPTARG ;;
	[?]) printHelpAndExit 1;;
    esac
done
echo $site
## Set ligation junction based on restriction enzyme
case $site in
    HindIII) ligation="AAGCTAGCTT";;
    DpnII) ligation="GATCGATC";;
    MboI) ligation="GATCGATC";;
    none) ligation="XXXX";;
    *)  ligation="XXXX"
      site_file=$site
      echo "$site not listed as recognized enzyme, so trying it as site file."
      echo "Ligation junction is undefined";;
esac
echo $ligation
if [ -z "$site_file" ]
then
    site_file="${juiceDir}/restriction_sites/${genomeID}_${site}.txt"
fi
echo $site_file
## Check that site file exists, needed for fragment number for merged_nodups
if [ ! -e "$site_file" ] && [ "$site" != "none" ]
then
    echo "***! $site_file does not exist. It must be created before running this script."
    echo "The site file is used for statistics even if fragment delimited maps are excluded"
    exit 100
fi

## Directories to be created and regex strings for listing files
megadir=${topDir}"/mega"
outputdir=${megadir}"/aligned"
tmpdir=${megadir}"/HIC_tmp"
# set global tmpdir so no problems with /var/tmp
export TMPDIR=${tmpdir}
#output messages
logdir=${megadir}"/debug"

## Check for existing merge_nodups files:

merged_count=`find -L ${topDir} | grep merged_nodups.txt | wc -l`
if [ "$merged_count" -lt "1" ]
then
	echo "***! Failed to find at least one merged_nodups files under ${topDir}"
	exit 100
fi

merged_names1=$(find -L ${topDir} | grep merged_nodups.txt)
merged_names=$(echo $merged_names1 | tr '\n' ' ')
inter_names=$(find -L ${topDir} | grep inter.txt | tr '\n' ' ')

if [[ $merged_names == *".txt.gz"* ]]
then
    gzipped=1
    echo "***! Mega map of gzipped files not yet supported, please unzip before running."
    exit 100
    # we need to unzip here
    for i in $merged_names1
    do
	if [[ $i != *".txt.gz"* ]]
	then
	    echo "***! Mixture of gzipped and unzipped merged_nodups files"
	    echo "Ensure that the merged_nodups are all either unzipped or gzipped then rerun"
	    echo "Files: $merged_names"
	    exit 100
	fi
    done
fi
## Create output directory, exit if already exists
if [[ -d "${outputdir}" ]] 
then
    echo "***! Move or remove directory \"${outputdir}\" before proceeding."
    exit 101			
else
    mkdir -p ${outputdir}
fi

## Create temporary directory
if [ ! -d "$tmpdir" ]; then
    mkdir $tmpdir
    chmod 777 $tmpdir
fi

## Create log directory
if [ ! -d "$logdir" ]; then
    mkdir $logdir
    chmod 777 $logdir
fi

if [ -n "$resolutions" ]; then
    resolutions="-r $resolutions"
fi

## Arguments have been checked and directories created. Now begins
## the real work of the pipeline
qsub -o ${logdir}/header.log -j oe -q batch -N ${groupname}_cmd <<-EOF
  date
  echo "Juicer version:$juicer_version"
  echo "$0 $@"
EOF

jid1=$(qsub -o ${logdir}/topstats.log -j oe -N ${groupname}_Tstats -l mem=20gb -l walltime=24:00:00 -l nodes=1:ppn=1:AMD -q batch <<-TOPSTATS
export LC_ALL=C
if ! awk -f ${juiceDir}/scripts/makemega_addstats.awk ${inter_names} > ${outputdir}/inter.txt
then  
    echo "***! Some problems occurred somewhere in creating top stats files."
    exit 100
else
    cp ${outputdir}/inter.txt ${outputdir}/inter_30.txt
fi
TOPSTATS
)
jobIDstr=${jid1}
# Merge all merged_nodups.txt files found under current dir
jid2=$(qsub -o ${logdir}/merge.log -j oe -q batch -N ${groupname}_merge -l mem=20gb -l walltime=24:00:00 -l nodes=1:ppn=1:AMD <<- MRGSRT
if ! sort -T ${tmpdir} -m -k2,2d -k6,6d ${merged_names} > ${outputdir}/merged_nodups.txt
then 
echo "***! Some problems occurred somewhere in merging sorted merged_nodups files."
    exit 100
else
echo "Finished sorting all merged_nodups files into a single merge."
  rm -r ${tmpdir}
fi
MRGSRT
)
jobIDstr="${jobIDstr}:${jid2}"

# Create statistics files for MQ > 0
jid3=$(qsub -o ${logdir}/inter0.log -j oe -q batch -N ${groupname}_inter0 -l mem=20gb -l walltime=24:00:00 -l nodes=1:ppn=1:AMD -W depend=afterok:${jid1}:${jid2} <<- INTER0
${juiceDir}/scripts/statistics.pl -q 1 -o${outputdir}/inter.txt -s $site_file -l $ligation ${outputdir}/merged_nodups.txt
INTER0
)
jobIDstr="${jobIDstr}:${jid3}"

# Create statistics files for MQ > 30
jid4=$(qsub -o ${logdir}/inter30.log -j oe -q batch -N ${groupname}_inter30 -l mem=20gb -l walltime=24:00:00 -l nodes=1:ppn=1:AMD -W depend=afterok:${jid1}:${jid2}  <<- INTER30
${juiceDir}/scripts/statistics.pl -q 30 -o${outputdir}/inter_30.txt -s $site_file -l $ligation ${outputdir}/merged_nodups.txt 
INTER30
)
jobIDstr="${jobIDstr}:${jid4}"

# Create HIC maps file for MQ > 0
jid5=$(qsub -o ${logdir}/hic0_${groupname}.log -j oe -q batch -M ${EMAIL} -m ae -N ${groupname}_hic0 -l mem=40gb -l walltime=168:00:00 -l nodes=1:ppn=1:AMD -W depend=afterok:${jid4} <<- HIC0
$load_java
if [ -z "$exclude" ]
then
  echo "Launching ${juiceDir}/scripts/juicer_tools pre ${resolutions} -f ${site_file} -s ${outputdir}/inter.txt -g ${outputdir}/inter_hists.m -q 1 ${outputdir}/merged_nodups.txt ${outputdir}/inter.hic ${genomeID}"
  ${juiceDir}/scripts/juicer_tools pre ${resolutions} -f ${site_file} -s ${outputdir}/inter.txt -g ${outputdir}/inter_hists.m -q 1 ${outputdir}/merged_nodups.txt ${outputdir}/inter.hic ${genomeID}
else
  echo "Launching ${juiceDir}/scripts/juicer_tools pre ${resolutions} -s ${outputdir}/inter.txt -g ${outputdir}/inter_hists.m -q 1 ${outputdir}/merged_nodups.txt ${outputdir}/inter.hic ${genomeID}"
  ${juiceDir}/scripts/juicer_tools pre ${resolutions} -s ${outputdir}/inter.txt -g ${outputdir}/inter_hists.m -q 1 ${outputdir}/merged_nodups.txt ${outputdir}/inter.hic ${genomeID}
fi
HIC0
)
jobIDstr="${jobIDstr}:${jid5}"
# Create HIC maps file for MQ > 30
jid6=$(qsub -o ${logdir}/hic30_${groupname}.log -j oe -q batch -M ${EMAIL} -m ae -N ${groupname}_hic30 -l mem=60gb -l walltime=168:00:00 -l nodes=1:ppn=1:AMD -W depend=afterok:${jid4} <<- HIC30
$load_java
if [ -z "${exclude}" ]
then
    echo "Launching ${juiceDir}/scripts/juicer_tools pre ${resolutions} -f ${site_file} -s ${outputdir}/inter_30.txt -g ${outputdir}/inter_30_hists.m -q 30 ${outputdir}/merged_nodups.txt ${outputdir}/inter_30.hic ${genomeID}"
    ${juiceDir}/scripts/juicer_tools pre ${resolutions} -f ${site_file} -s ${outputdir}/inter_30.txt -g ${outputdir}/inter_30_hists.m -q 30 ${outputdir}/merged_nodups.txt ${outputdir}/inter_30.hic ${genomeID}
else
   echo "Launching ${juiceDir}/scripts/juicer_tools pre ${resolutions} -s ${outputdir}/inter_30.txt -g ${outputdir}/inter_30_hists.m -q 30 ${outputdir}/merged_nodups.txt ${outputdir}/inter_30.hic ${genomeID}"
   ${juiceDir}/scripts/juicer_tools pre ${resolutions} -s ${outputdir}/inter_30.txt -g ${outputdir}/inter_30_hists.m -q 30 ${outputdir}/merged_nodups.txt ${outputdir}/inter_30.hic ${genomeID}
fi
HIC30
)
jobIDstr="${jobIDstr}:${jid6}"

# Create loop and domain lists file for MQ > 30
jid7=$(qsub  -o ${logdir}/hiccups.log -j oe -q batch -N ${groupname}hiccups -l mem=60gb -l walltime=100:00:00 -l nodes=1:ppn=1:gpus=1:GPU -W depend=afterok:${jid6} <<- HICCUPS
$load_java
$load_cuda
echo $PBS_GPUFILE
export _JAVA_OPTIONS=-Xmx16384m;
export LC_ALL=C
${juiceDir}/scripts/juicer_hiccups.sh -j ${juiceDir}/scripts/juicer_tools -i ${outputdir}/inter_30.hic -m ${juiceDir}/references/motif -g ${genomeID}
B
HICCUPS

)

jobIDstr="${jobIDstr}:${jid7}"

jid8=$(qsub -o ${logdir}/arrowhead.log -j oe -q batch -N ${groupname}_ArwHead -l mem=60gb -l walltime=100:00:00 -l nodes=1:ppn=1:gpus=1:GPU -W depend=afterok:${jid6} <<- ARROWHEAD
$load_java
$load_cuda
echo $PBS_GPUFILE
export _JAVA_OPTIONS=-Xmx16384m;
export LC_ALL=C
${juiceDir}/scripts/juicer_arrowhead.sh -j ${juiceDir}/scripts/juicer_tools -i ${outputdir}/inter_30.hic
ARROWHEAD
)

qsub -o ${logdir}/done.log -j oe -q batch -N ${groupname}_done -W depend=afterok:${jobIDstr} <<- FINAL
echo "All jobs finished processing!"
FINAL
qsub -o ${logdir}/done.out -j oe -q batch -N ${groupname}_fail -W depend=afternotok:${jobIDstr} <<- FINAL
echo "Error occored in placing the jobs. Please check err file of each step to find out"
FINAL
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
jID_launch=$( qstat | grep Lnch_${groupname} | cut -d '.' -f 1 )
if [ -z $postproc ]
then
    waitstring3="#PBS -W depend=afterok:${jID_launch}"
fi
echo "waitstring3 is: ${waitstring3}"
echo $project
timestamp=$(date +"%s" | cut -c 4-10)
echo "running postprocwrap"
qsub <<- POSTPROCWRAP
#PBS -S /bin/bash
#PBS -q $queue
#PBS -A $project
#PBS -l $walltime
#PBS -l select=1:ncpus=1:mem=4g
#PBS -o ${logdir}/${timestamp}_postproc_wrap_${groupname}.log
#PBS -j oe
#PBS -N PPrWrp${groupname}
${waitstring3}

date +"%Y-%m-%d %H:%M:%S"

jID_hic30=\$(qstat | grep hic30_${groupname} |cut -d '.' -f 1)
echo \$jID_hic30
echo "try this..."
echo \$(qstat | grep hic30_${groupname} )
echo "--------"

if [ -z $postproc ]
then
    waitstring4="#PBS -W depend=afterok:\${jID_hic30}" 
fi
echo "waitstring4 is : \${waitstring4}"

timestamp=\$(date +"%s" | cut -c 4-10)
echo "running postprocess"
qsub <<POSTPROCESS
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $long_walltime
    #PBS -l select=1:ncpus=1:mem=60g
    #PBS -o ${logdir}/\${timestamp}_postproc_${groupname}.log
    #PBS -j oe
    #PBS -N PProc_${groupname}
    #PBS -A $project
    \$waitstring4
	#This formerly had an ngpus=1
    date +"%Y-%m-%d %H:%M:%S"
    $load_java
    $load_cuda
    module list 
    export _JAVA_OPTIONS=-Xmx16384m
    export LC_ALL=en_US.UTF-8

    ${juiceDir}/scripts/juicer_postprocessing.sh -j ${juiceDir}/scripts/juicer_tools -i ${outputdir}/inter_30.hic -m ${juiceDir}/references/motif -g $genomeID
echo "done postprocess"
POSTPROCESS
echo "done postprocwrap"
POSTPROCWRAP

jID_postprocwrap=$( qstat |grep PPrWrp${groupname} | cut -d '.' -f 1 )
echo $jID_postprowrap
wait
timestamp=$(date +"%s" | cut -c 4-10)
echo "running finck"
qsub <<- FINCK
#PBS -S /bin/bash
#PBS -q $queue  
#PBS -l $walltime
#PBS -o ${logdir}/${timestamp}_prep_done_${groupname}.log
#PBS -j oe
#PBS -N Pdone_${groupname}
#PBS -W depend=afterok:${jID_postprocwrap}
#PBS -A $project
#PBS -l select=1:ncpus=1:mem=50gb

date +"%Y-%m-%d %H:%M:%S"    
jID_hic30=\$(qstat | grep hic30_${groupname} |cut -d '.' -f 1)
jID_stats0=\$(qstat | grep stats0${groupname} |cut -d '.' -f 1)
jID_stats30=\$(qstat | grep stats30${groupname} |cut -d '.' -f 1)
jID_hic=\$(qstat | grep hic0_${groupname} |cut -d '.' -f 1)
jID_postproc=\$(qstat | grep PProc_${groupname} |cut -d '.' -f 1)

waitstring5="#PBS -W depend=afterok:\${jID_postproc}"
if [ -z $postproc ]
then
    waitstring5="#PBS -W depend=afterok:\${jID_hic30}:\${jID_stats0}:\${jID_stats30}:\${jID_hic}:\${jID_postproc}"
fi
timestamp=\$(date +"%s" | cut -c 4-10)
echo "running done"
qsub <<DONE
    #PBS -S /bin/bash
    #PBS -q $queue
    #PBS -l $walltime
    #PBS -l select=1:ncpus=1:mem=4g
    #PBS -o ${logdir}/\${timestamp}_done_${groupname}.log
    #PBS -j oe
    #PBS -N done_${groupname}
    #PBS -A $project
    \${waitstring5}

    date +"%Y-%m-%d %H:%M:%S"    
    export splitdir=${splitdir}
    export outputdir=${outputdir}
    ${juiceDir}/scripts/check.sh
echo "done done"
DONE
echo "done finck"
FINCK
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
ls -l splits > ls_splits

awk '($9 ~/_R1/ && $9 ~/fastq$/) || ($9 ~/_R1/ && $9 ~/gz$/) {split($9, a, "_R1");  print a[1]a[2], $9}($9 ~/_R2/ && $9 ~/fastq$/) || ($9 ~/_R2/ && $9 ~/gz$/){split($9,a,"_R2");print a[1]a[2], $9}' ls_splits > fastq.txt
awk '{name="splits/"$1"_norm.txt.res.txt"; if ((getline line < name) > 0){ split(line, a); if (length(a)<6 || a[2] != a[3]+a[4]+a[5]+a[6] || a[2] == 0){print "mv splits/"$2, "not_done; rm -f splits/"$1"*; rm -f splits/"$2"*;"} close(name);}else {print "mv splits/"$2, "not_done; rm -f splits/"$1"*; rm -f splits/"$2"*;"}}' fastq.txt > mv_me.sh
mkdir not_done
chmod 755 mv_me.sh
./mv_me.sh
if [ "$(ls -A not_done)" ]
    then
        if [ -d done_splits ]
        then
            mv splits/* done_splits/.
            rmdir splits
        else
            mv splits done_splits
        fi
        mv not_done splits
else
    rmdir not_done
    echo "Fastqs are aligned, run juicer.sh with -S merge flag";
fi
rm fastq.txt ls_splits mv_me.sh
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
use File::Basename;
use POSIX;
use List::Util qw[min max];
use Getopt::Std;
use vars qw/ $opt_s $opt_l $opt_d $opt_o $opt_q $opt_h /;

# Check arguments
getopts('s:l:o:q:h');

my $site_file = "/opt/juicer/restriction_sites/hg19_DpnII2.txt";
my $ligation_junction = "GATCGATC";
my $stats_file = "stats.txt";
my $mapq_threshold = 1;

if ($opt_h) {
    print "Usage: statistics.pl -s[site file] -l[ligation] -o[stats file] -q[mapq threshold] <infile>\n";
    print " <infile>: file in intermediate format to calculate statistics on, can be stream\n";
    print " [site file]: list of HindIII restriction sites, one line per chromosome (default $site_file)\n";
    print " [ligation]: ligation junction (default $ligation_junction)\n";
    print " [stats file]: output file containing total reads, for library complexity (default $stats_file)\n";
    print " [mapq threshold]: mapping quality threshold, do not consider reads < threshold (default $mapq_threshold)\n";
    exit;
}
if ($opt_s) {
  $site_file = $opt_s;
}
if ($opt_l) {
  $ligation_junction = $opt_l;
}
if ($opt_o) {
  $stats_file = $opt_o;
}
if ($opt_q) {
  $mapq_threshold = $opt_q;
}
if (scalar(@ARGV)==0) {
  print STDOUT "No input file specified, reading from input stream\n";
}

my $dangling_junction = substr $ligation_junction, length($ligation_junction)/2;

# Global variables for calculating statistics
my %chromosomes;
my %hindIII;
my %mapQ;
my %mapQ_inter;
my %mapQ_intra;
my %innerM;
my %outerM;
my %rightM;
my %leftM;
my $three_prime_end=0;
my $five_prime_end=0;
my $total = 0;
my $dangling = 0;
my $ligation = 0;
my $inner = 0;
my $outer = 0;
my $left = 0;
my $right = 0;
my $inter = 0;
my $intra = 0;
my $small = 0;
my $large = 0;
my $very_small = 0;
my $very_small_dangling = 0;
my $small_dangling = 0;
my $large_dangling = 0;
my $inter_dangling = 0;
my $true_dangling_intra_small = 0;
my $true_dangling_intra_large = 0;
my $true_dangling_inter = 0;
my $total_current = 0;
my $under_mapq = 0;
my $intra_fragment = 0;
my $unique = 0;
# logspace bins
my @bins = (10,12,15,19,23,28,35,43,53,66,81,100,123,152,187,231,285,351,433,534,658,811,1000,1233,1520,1874,2310,2848,3511,4329,5337,6579,8111,10000,12328,15199,18738,23101,28480,35112,43288,53367,65793,81113,100000,123285,151991,187382,231013,284804,351119,432876,533670,657933,811131,1000000,1232847,1519911,1873817,2310130,2848036,3511192,4328761,5336699,6579332,8111308,10000000,12328467,15199111,18738174,23101297,28480359,35111917,43287613,53366992,65793322,81113083,100000000,123284674,151991108,187381742,231012970,284803587,351119173,432876128,533669923,657933225,811130831,1000000000,1232846739,1519911083,1873817423,2310129700,2848035868,3511191734,4328761281,5336699231,6579332247,8111308308,10000000000);

if (index($site_file, "none") != -1) {
   #no restriction enzyme, no need for RE distance
 }
else {
  # read in restriction site file and store as multidimensional array
  open FILE, $site_file or die $!;
  while (<FILE>) {
    my @locs = split;
    my $key = shift(@locs);
    my $ref = \@locs;
    $chromosomes{$key} = $ref;
  }
  close(FILE);
}
# read in infile and calculate statistics
#open FILE, $infile or die $!;
while (<>) {
  $unique++;
	my @record = split;
	my $num_records = scalar(@record);
  # don't count as Hi-C contact if fails mapq or intra fragment test
  my $countme = 1;

  if (($record[1] eq $record[5]) && $record[3] == $record[7]) {
    $intra_fragment++;
    $countme = 0;
  }
	elsif ($num_records > 8) {
		my $mapq_val = min($record[8],$record[11]);
    if ($mapq_val < $mapq_threshold) {
      $under_mapq++;
      $countme = 0;
    }
  }


  if ($countme) {
    $total_current++;
    # position distance
    my $pos_dist = abs($record[2] - $record[6]);

    my $hist_dist = &bsearch($pos_dist,\@bins);	 

    my $is_dangling = 0;
    # one part of read pair has unligated end
    if ($num_records > 8 && ($record[10] =~ m/^$dangling_junction/ || $record[13] =~ m/^$dangling_junction/)) {
      $dangling++;
      $is_dangling=1;
    }
    # look at chromosomes
    if ($record[1] eq $record[5]) {
      $intra++;
      # determine right/left/inner/outer ordering of chromosomes/strands
      if ($record[0] == $record[4]) {
        if ($record[0] == 0) {
          if ($pos_dist >= 20000) {
            $right++;
          }
          $rightM{$hist_dist}++;
        }
        else {
          if ($pos_dist >= 20000) {
            $left++;
          }
          $leftM{$hist_dist}++;
        }
      }
      else {
        if ($record[0] == 0) {
          if ($record[2] < $record[6]) {
            if ($pos_dist >= 20000) {
              $inner++;
            }
            $innerM{$hist_dist}++;
          }
          else {
            if ($pos_dist >= 20000) {
              $outer++;
            }
            $outerM{$hist_dist}++;
          }
        }
        else {
          if ($record[2] < $record[6]) {
            if ($pos_dist >= 20000) {
              $outer++;
            }
            $outerM{$hist_dist}++;
          }
          else {
            if ($pos_dist >= 20000) {
              $inner++;
            }
            $innerM{$hist_dist}++;
          }
        }
      }
      # intra reads less than 20KB apart
      if ($pos_dist < 10) {
        $very_small++;
        if ($is_dangling) {
          $very_small_dangling++;
        }
      }
      elsif ($pos_dist < 20000) {
        $small++;
        if ($is_dangling) {
          $small_dangling++;
        }
      }
      else {
        $large++;
        if ($is_dangling) {
          $large_dangling++;
        }
      }
    }
    else {
      $inter++;
      if ($is_dangling) {
        $inter_dangling++;
      }
    }
    if ($num_records > 8) {
      my $mapq_val = min($record[8],$record[11]);
      if ($mapq_val <= 200) {
        $mapQ{$mapq_val}++;
        if ($record[1] eq $record[5]) {
          $mapQ_intra{$mapq_val}++;
        }
        else {
          $mapQ_inter{$mapq_val}++;
        }
      }
      # read pair contains ligation junction
      if ($record[10] =~ m/$ligation_junction/ || $record[13] =~ m/$ligation_junction/) {
        $ligation++;
      }
    }
    # determine distance from nearest HindIII site, add to histogram
    if (index($site_file, "none") == -1) {
    my $report = (($record[1] != $record[5]) || ($pos_dist >= 20000));
    my $dist = &distHindIII($record[0], $record[1], $record[2], $record[3], $report);
    if ($dist <= 2000) {
      $hindIII{$dist}++;
    }

    $dist = &distHindIII($record[4], $record[5], $record[6], $record[7], $report);
    if ($dist <= 2000) {
      $hindIII{$dist}++;
    }
    }
    if ($is_dangling) {
      if ($record[10] =~ m/^$dangling_junction/) {
        $dist = &distHindIII($record[0], $record[1], $record[2], $record[3], 1);
      }
      else { #	$record[13] =~ m/^$dangling_junction/) 
        $dist = &distHindIII($record[4], $record[5], $record[6], $record[7], 1);
      }
      if ($dist == 1) {
        if ($record[1] == $record[5]) {
          if ($pos_dist < 20000) {
            $true_dangling_intra_small++;
          }
          else {
            $true_dangling_intra_large++;
          }
        }
        else {
          $true_dangling_inter++;
        }
      }
    }
  }
 }

my($statsfilename, $directories, $suffix)= fileparse($stats_file, qr/\.[^.]*/);
my $histsfile = $directories . $statsfilename . "_hists.m";

my $seq=0;
if (-e $stats_file) {
  open FILE, $stats_file or die $!;
  while (<FILE>) {
    if (/Sequenced/g) {
      ($label, $reads) = split(':', $_);
      $seq=1;
      $reads =~ s/,//g;
      $reads =~ s/ //g;
    }
  } 
  close FILE;
}
open FILE, " >> $stats_file", or die $!;
if ($unique==0) {
	$unique=1;
}

print FILE "Intra-fragment Reads: " . commify($intra_fragment);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $intra_fragment*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $intra_fragment*100/$unique; 

print FILE "Below MAPQ Threshold: " . commify($under_mapq);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $under_mapq*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $under_mapq*100/$unique; 

print FILE "Hi-C Contacts: " . commify($total_current);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $total_current*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $total_current*100/$unique; 

printf FILE " Ligation Motif Present: %s ", commify($ligation);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $ligation*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $ligation*100/$unique; 

if ($five_prime_end + $three_prime_end > 0) {
  my $f1 = $three_prime_end*100/($five_prime_end + $three_prime_end);
  my $f2 = $five_prime_end*100/($five_prime_end + $three_prime_end);
  printf FILE " 3' Bias (Long Range): %0.0f\%", $f1;
  printf FILE " - %0.0f\%\n", $f2;
}
else {
  print FILE " 3' Bias (Long Range): 0\% \- 0\%\n";
}
if ($large > 0) {
	printf FILE " Pair Type %(L-I-O-R): %0.0f\%", $left*100/$large;
  printf FILE " - %0.0f\%", $inner*100/$large;
  printf FILE " - %0.0f\%", $outer*100/$large;
  printf FILE " - %0.0f\%\n", $right*100/$large;
}
else {
	print FILE " Pair Type %(L-I-O-R): 0\% - 0\% - 0\% - 0\%\n";
}

printf FILE "Inter-chromosomal: %s ", commify($inter);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $inter*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $inter*100/$unique; 

printf FILE "Intra-chromosomal: %s ", commify($intra);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $intra*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $intra*100/$unique; 

printf FILE "Short Range (<20Kb): %s ", commify($small);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $small*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $small*100/$unique; 

printf FILE "Long Range (>20Kb): %s ", commify($large);
if ($seq == 1) {
  printf FILE " (%0.2f\% / ", $large*100/$reads; 
}
else {
  print FILE "(";
}
printf FILE "%0.2f\%)\n", $large*100/$unique; 

close FILE;

open FILE, "> $histsfile", or die $!; 
print FILE "A = [\n";
for (my $i=1; $i <= 2000; $i++) {
	my $tmp =	 $hindIII{$i} || 0;
	print FILE "$tmp ";
}
print FILE "\n];\n";
print FILE "B = [\n";
for (my $i=0; $i <= 200; $i++) {
	my $tmp = $mapQ{$i} || 0;
	my $tmp2 = $mapQ_intra{$i} || 0;
	my $tmp3 = $mapQ_inter{$i} || 0;
	print FILE "$tmp $tmp2 $tmp3\n ";
}
print FILE "\n];\n";
print FILE "D = [\n";
for (my $i=0; $i < scalar(@bins); $i++) {
	my $tmp = $innerM{$i} || 0;
	print FILE "$tmp ";
	$tmp = $outerM{$i} || 0;
	print FILE "$tmp ";
	$tmp = $rightM{$i} || 0;
	print FILE "$tmp ";
	$tmp = $leftM{$i} || 0;
	print FILE "$tmp\n";
}

print FILE "\n];";
print FILE "x = [\n";
for (my $i=0; $i < scalar(@bins); $i++) {
	print FILE "$bins[$i] ";
}
print FILE "\n];\n";
close FILE;

# Find distance to nearest HindIII restriction site
sub distHindIII {
	# find upper index of position in sites array via binary search
	my $index = $_[3];
	# get distance to each end of HindIII fragment
	my $dist1;
	if ($index == 0) {
		# first fragment, distance is position
		$dist1 =	$_[2];	
	}
	else {
		$dist1 = abs($_[2] - $chromosomes{$_[1]}[$index-1]);
	}
	my $dist2 = abs($_[2] - $chromosomes{$_[1]}[$index]);

	# get minimum value -- if (dist1 <= dist2), it's dist1, else dist2
	my $retval = $dist1 <= $dist2 ? $dist1 : $dist2; 
	# get which end of the fragment this is, 3' or 5' (depends on strand)
	if ($retval == $dist1 && $_[4]) {
		$_[0] == 0 ? $five_prime_end++ : $three_prime_end++;
	}
	elsif ($retval == $dist2 && $_[4]) {
		$_[0] == 16 ? $five_prime_end++ : $three_prime_end++;
	}
	return $retval;
}

# Binary search, array passed by reference
# search array of integers a for given integer x
# return index where found or upper index if not found
sub bsearch {
	my ($x, $a) = @_;		 # search for x in array a
	my ($l, $u) = (0, @$a - 1);	 # lower, upper end of search interval
	my $i;	      		 # index of probe
	while ($l <= $u) {
		$i = int(($l + $u)/2);
		if ($a->[$i] < $x) {
	    $l = $i+1;
		}
		elsif ($a->[$i] > $x) {
	    $u = $i-1;
		}
		else {
	    return $i; # found
		}
	}
	return $l;				 # not found, return upper
}
sub commify {
    my $text = reverse $_[0];
    $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
    return scalar reverse $text;
}
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/natbutter/juicer
Name: flashlite-juicer
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...