In a nutshell:¶
Available presets¶
There are available preset of snippets and pipelines. You can access the list of repository and manage the installed modules with the repos subcommand
$ pype repos
usage: pype repos [-r REPO_LIST] {list,install,clean,info} ...
positional arguments:
{list,install,clean,info}
list List the available repositories
install Install modules from selected repository
clean Cleanup all module folders
info Print location of the modules currently in use
optional arguments:
-r REPO_LIST, --repo REPO_LIST
Repository list. Default:
/usr/lib/python2.7/site-packages/pype/repos.yaml
Overview of the framework¶
The framework is heavily based on the python argparse
module
aiming for a self explanatory use of the main command line
interface: pype.
The management of tools version and sources (assembly/organisms) is organized in profiles files.
$ pype profiles
usage: pype profiles {info,check}
positional arguments:
{info,check}
info Retrieve details from available profiles
check Check if a profile is valid
$ pype profiles info --all
default: Default profile b37 from 1k genome project
hg19: Profile for the UCSC hg19 assembly
hg38: Profile for the hg38 assembly human genome
Each profile is defined by a YAML file, an example:
info:
description: HG19 default profile
date: 10/03/2016
genome_build: 'hg19'
files:
genome_fa: /path/to/b37/genome.fa
genome_fa_gz: /path/to/b37//genome.fa.gz
exons_list: /path/to/exome_hg37_interval_list.txt
db_snp: /path/to/dbSNP/dbsnp_138.b37.vcf.gz
cosmic: /path/to/cosmic/CosmicCodingMuts.vcf.gz
programs:
samtools_0:
path: samtools
version: 0.1.18
samtools_1:
path: samtools
version: 1.2
Each tools is wrapped into a python function (snippets), which
is designed to load program specified in the selected profile, and
usually execute the tools using the subprocess
module.
$ pype snippets
error: too few arguments
usage: pype snippets [--log LOG]
{biobambam_merge,mutect,bwa_mem,freebayes}
...
positional arguments:
{biobambam_merge,mutect,bwa_mem,freebayes}
biobambam_merge Mark duplicates and merge BAMs with biobambam
mutect Call germline and somatic variants with mutect
oncotator Annotate various file format with Oncotator
bwa_mem Align fastQ file with the BWA mem algorithm
freebayes A haplotype-based variant detector
The tools can be chained and combined one-another within a YAML files (pipelines), which specify dependencies and input files. The profile is parsed to provide command line option by the argparse interface
$ pype pipelines
error: too few arguments
usage: pype pipelines [--queue {msub,echo,none}] [--log LOG]
{bwa,freebayes,mutect,bwa_mem}
...
positional arguments:
{bwa,delly,freebayes,mutect,bwa_mem}
bwa Align with bwa mem, merge different lanes and
performs various QC stats
freebayes Freebayes pipeline from Fastq to VCF
mutect Mutect from fastq files
bwa_mem Align with bwa mem and performs QC stats
optional arguments:
--queue {msub,echo,none}
Select the queuing system to run the pipeline
--log LOG Path used to write the pipeline logs. Default
working directory.
A typical scenario that demonstrate the advantages offered by this package is the alignment of two DNA sequencing (NGS) samples, a normal and a tumor samples, and perform a comparison between two samples to find somatic mutations. In the command line this task require for each sample a command for the alignment, a command for sorting and a command for indexing. After the two samples are processed the results need to be compared by a mutation caller, usually specifying some database such as COSMIC or dbSNP.
The easiest and fastest way to implement the tasks is by wrapping the command line into bash script and submit them to a queuing system, or run it directly in a shell. This would impose a self organization of logs and scripts, and still possibli results in a lot of maintenance to keep the scripts up to date.
Other pipeline systems can solve the problem, but they may not be executable on the fly by command line interface (eg, need to configure file previous execution) or they may not allows to change genome build or database version easily.
with the bio_pype framework each step can be launched separately, example of usage of the alignment snippet:
pype snippets bwa_mem
usage:
pype snippets bwa_mem -h HEADER -1 F1 -2 F2 [-t TMP] [-o OUT]
optional arguments:
-h HEADER Header file containing the @RG groups,
or the comma separated header line
-1 F1 First mate fastQ file
-2 F2 Second mate fastQ file
-t TMP, --tmp TMP Temporary folder
-o OUT, --out OUT Output name for the bam file
or use the pipeline:
pype pipelines mutect
usage: pype pipelines mutect --tumor_bam TUMOR_BAM --out_dir OUT_DIR
--sample_name SAMPLE_NAME --normal_bam NORMAL_BAM
[--tmp_dir TMP_DIR] --tumor_bwa_list
TUMOR_BWA_LIST --qc_bam_tumor QC_BAM_TUMOR
--normal_bwa_list NORMAL_BWA_LIST --qc_bam_normal
QC_BAM_NORMAL
Required:
Required pipeline arguments
--tumor_bam TUMOR_BAM
Input Bam file of the tumor sample, type: str
--out_dir OUT_DIR Output file name prefix for the analysis, type: str
--sample_name SAMPLE_NAME
Sample name or identifier of the run, type: str
--normal_bam NORMAL_BAM
Input Bam file of the normal/control sample, type: str
--tumor_bwa_list TUMOR_BWA_LIST
Batch file to run the bwa_mem pipeline on different
lanes of the tumor, type: str
--qc_bam_tumor QC_BAM_TUMOR
Tumor BAM files QC directory output path, type: str
--normal_bwa_list NORMAL_BWA_LIST
Batch file to run the bwa_mem pipeline on different
lanes of the normal/control, type: str
--qc_bam_normal QC_BAM_NORMAL
Normal/Control BAM files QC directory output path,
type: str
Optional:
Optional pipeline arguments
--tmp_dir TMP_DIR temporary folder, type: str. Default: /scatch
Python and Environment Modules:¶
example of loading the hg37 profile in the python prompt:
from pype.modules.profiles import get_profiles
profiles = get_profiles({})
profile = profiles['hg37']
genome = profile.files['genome_fa_gz']
The framework also supports Environment Modules:
from pype.modules.profiles import get_profiles
from pype.env_modules import get_module_cmd, program_string
profiles = get_profiles({})
module('add', program_string(profile.programs['samtools_1']))