Single-Cell Data Test Dataset Generation Workflow

public public 1yr ago 0 bookmarks

This workflow creates small test datasets for single cell data analyses. The generated data is available in the folders ref and reads , such that the repository can be directly used as a git submodule for continuous integration tests.

Based on https://github.com/snakemake-workflows/ngs-test-data

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pysam
import sys
import dnaio
import gzip

bam_fname = snakemake.input.bam
fastq_fnames = snakemake.input.fastqs

bamfile = pysam.AlignmentFile(bam_fname, "rb")

bam_reads = {read.query_name:read for read in bamfile.fetch("chr"+snakemake.wildcards.chrom)}
bam_readnames = set(bam_reads)

bamfile.close()

fastq_reads = {}

for fname in fastq_fnames:
    with dnaio.open(fname) as fh:
        for record in fh:
            name = record.name.split()[0]
            if name in bam_readnames:
                fastq_reads[name] = record

with gzip.open(snakemake.output.r1,"w") as ofh1:
    with gzip.open(snakemake.output.r2,"w") as ofh2:
        for n in bam_readnames:
            fastq_read = fastq_reads[n]
            bam_read = bam_reads[n]
            ofh1.write("@{}\n{}\n+\n{}\n".format(fastq_read.name.split()[0], fastq_read.sequence, fastq_read.qualities).encode())
            ofh2.write("@{}\n{}\n+\n{}\n".format(bam_read.query_name, bam_read.query_sequence, "".join([chr(x + 33) for x in bam_read.query_qualities])).encode())
21
22
shell:
    "zgrep -e ^{wildcards.chrom} {input} > {output}"
29
30
shell:
    "gzip -d -c {input} > {output}"
40
41
shell:
    "samtools view -b -s{params.seed}{config[sampling_factor]} {params.url} chr{wildcards.chrom} > {output}"
53
54
shell:
    "samtools index {input}"
66
67
script:
    "scripts/gen_fastqs.py"
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/cnio-bu/sc-test-data
Name: sc-test-data
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...