Projet fait par Pierre BERGERET, Camille RABIER, Lydie TRAN et Dalyan VENTURA

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Installations pré-requises et commande pour lancer le script

La commande à lancer est :

snakemake --use-singularity --resources load=100 --cores all

Afin de lancer le workflow, Snakemake et singularity doivent être installés.
conda install snakemake
conda install singularity

Résumé du workflow

Le script télécharge l'ensemble des FASTQ issus des expériences de RNAseq présent dans la liste "samples", puis télécharge les chromosomes humains et les regroupe en un fichier du génome.Il télécharge aussi le fichier d'annotation du génome humain.

Le génome humain est ensuite indexé à l'aide de STAR, puis les reads issus des FASTQ sont alignés sur le génome à l'aide de STAR.

L'alignement par STAR donne des fichiers bam qui sont ensuite indexés par samtools.

L'ensemble des FASTQ et bam ont été analysés par FASTQC. Les pages résumants ces analyses sont dans le dossier FASTQC

Le workflow utilise ensuite featureCounts pour compter le nombre de reads par gènes et par exons et pour toutes les possbilités de brins (0: unstranded, 1: stranded, 2: reversely stranded).

Finalement, nous avons analysés à l'aide de DESEQ2 et de PCA, les tables de comptages pour les gènes unstranded. Avec nos samples, les stranded et reversely stranded donnant des tables de comptages avec uniquement des 0.

Ressources demandées par le script

La machine virtuelle utilisée pour les tests correspond à celle de ifb-core avec les caractéristiques suivantes:

16 cores
64GB de RAM
920GB de stockage

Il est cependant normalement possible de se limiter à une machine avec 32GB de RAM au lieu de 64GB.

Code Snippets

library(DESeq2)
library(factoextra)
library(FactoMineR)

# Chargement des donnees
files <- (Sys.glob("gene_strand_0/*.counts"))
list_files <- lapply(files, function(x) read.table(x, header = T)) 

######  CREATION DATAFRAME
nb_ech = length(list_files) 
nb_gene = nrow(list_files[[1]])

id_gene <- list_files[[1]]["Geneid"]

df <- rep(0, nb_ech)

for (i in 1:nb_ech) {
   df[i] <- data.frame(list_files[[i]][,7])
}


df <- cbind(id_gene, df)
colnames(df) <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
df_t <- t(as.matrix(df[,-1]))


# Specifier les types 
Type = data.frame("Mutant_R625C","Mutant_R625C","Mutant_R625H","WT","WT","WT","WT","Mutant_R625C")
Type = t(Type)
row.names(Type) <- c(2, 3, 4, 5, 6, 7, 8, 9)
colnames(Type) <- "Type"

#######  PCA DONNEES BRUTES #########
pdf("PCA_not_normalized.pdf")
fviz_pca_ind(PCA(df_t,graph=F),col.ind = Type)
dev.off()

#### PCA AVEC DATA FRAME AVEC UNIQUEMENT LES GENES EXPRIMES
ind_exp <- rowSums(df[,-1])
ind_exp <- which(ind_exp > 0)
df_exp <- df[ind_exp,]
id_gene_exp <- id_gene[ind_exp,]

#Analyse différentielle
Des <- DESeqDataSetFromMatrix(countData = df_exp[-1], colData = Type, design = ~Type)
analysis = DESeq(Des)
res = results(analysis)
write.table(res, "Deseq2_results_4mutants_table.txt", sep = "\t", row.names = TRUE, col.names = TRUE)

#Normalisation
vst = getVarianceStabilizedData(analysis)
vst_t <- t(vst)

#PCA SUR DONNÉES NORMALISÉES
pdf("PCA_expressed_normalized.pdf")
fviz_pca_ind(PCA(vst_t,graph=F),col.ind = Type)
dev.off()

#On voit que le mutant R625H se comporte comme un wild type, nous avons donc changé son type pour wild type
Type2 = t(data.frame("Mutant","Mutant","WT","WT","WT","WT","WT","Mutant"))
row.names(Type2) <- c(2, 3, 4, 5, 6, 7, 8, 9)
colnames(Type2) <- "Type"
Des2 <- DESeqDataSetFromMatrix(countData = df_exp[-1], colData = Type2, design = ~Type)
analysis2 = DESeq(Des2)
res2 = results(analysis2)
write.table(res2, "Deseq2_results_3mutants_table.txt", sep = "\t", row.names = TRUE, col.names = TRUE)


res_mat <- as.matrix(res2)
res_mat <- data.frame(id_gene_exp, res_mat)

### EXTRACTION DES GENES

# Lignes dont la pvalue ajustee est < 0.05
ind_padj <- res_mat[,"padj"]
ind_padj <- which(ind_padj<0.05)

exprime_diff = length(which(res_mat[,"padj"]<0.05 & (res_mat[,"log2FoldChange"]>1 |res_mat[,"log2FoldChange"]< -1)))
sur_exprime =length(which(res_mat[,"padj"]<0.05 & res_mat[,"log2FoldChange"]>1))
sous_exprime=length(which(res_mat[,"padj"]<0.05 & res_mat[,"log2FoldChange"]< -1))
print(paste("Genes sous et sur exprimés:",exprime_diff))
print(paste("Genes sous exprimés:",sous_exprime))
print(paste("Genes sur exprimés:",sur_exprime))




# Volcano plot
logp <- -log10(res_mat[,"padj"])
pdf("VolcanoPlot.pdf")
plot(logp~res_mat[,"log2FoldChange"], main = "Expression différentielle des gènes entre WT et mutés", xlab = "Log2FoldChange", ylab = "- Log10(padj)")
dev.off()

R factoextra FactoMineR From line 1 of master/script_analyse_deseq.R

shell:
 "fasterq-dump {wildcards.sample}"

SnakeMake From line 23 of master/Snakefile

shell:
 """
  for chr in "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "MT" "X" "Y";
  do
  	wget -O "$chr".fa.gz ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome."$chr".fa.gz
  done
  gunzip -c *.fa.gz > {output}
 """

SnakeMake From line 32 of master/Snakefile

shell: "STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir ref/ --genomeFastaFiles {input}"

SnakeMake STAR From line 52 of master/Snakefile

shell:
 """
  STAR --outSAMstrandField intronMotif \
  --outFilterMismatchNmax 4 \
  --outFilterMultimapNmax 10 \
  --genomeDir ref \
  --readFilesIn {input.sample1} {input.sample2} \
  --runThreadN {threads} \
  --outSAMunmapped None \
  --outSAMtype BAM SortedByCoordinate \
  --outStd BAM_SortedByCoordinate \
  --genomeLoad NoSharedMemory \
  --limitBAMsortRAM 31000000000 \
  --outFileNamePrefix {wildcards.SAMPLE} > {output}
 """

SnakeMake STAR From line 67 of master/Snakefile

shell:
 """
  samtools index {input}
 """

SnakeMake SAMtools From line 93 of master/Snakefile

shell:
 """
  wget -O {output}.gz ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens/Homo_sapiens.GRCh38.101.chr.gtf.gz
  gunzip {output}
 """

SnakeMake From line 103 of master/Snakefile

shell:
 """
  featureCounts -T {threads} -t {wildcards.TYPE} -g gene_id -s {wildcards.STRAND} -a {input.genome} -o {output} {input.bam}
 """

SnakeMake FeatureCounts From line 124 of master/Snakefile

shell:
 """
  fastqc {input.bam} -o FASTQC
  fastqc {input.sample1} -o FASTQC
  fastqc {input.sample2} -o FASTQC
 """

SnakeMake FastQC From line 143 of master/Snakefile

script:
 "script_analyse_deseq.R"

SnakeMake From line 157 of master/Snakefile

ShowHide 5 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/Annaklumos/Hackathon

Name: hackathon

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

DESeq2 FastQC FeatureCounts SAMtools Snakemake STAR factoextra FactoMineR

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free