In a nutshell:¶

Available presets¶

There are available preset of snippets and pipelines. You can access the list of repository and manage the installed modules with the repos subcommand

$ pype repos
usage: pype repos [-r REPO_LIST] {list,install,clean,info} ...

positional arguments:
  {list,install,clean,info}
  list                List the available repositories
  install             Install modules from selected repository
  clean               Cleanup all module folders
  info                Print location of the modules currently in use

optional arguments:
  -r REPO_LIST, --repo REPO_LIST
                    Repository list. Default:
                    /usr/lib/python2.7/site-packages/pype/repos.yaml

Overview of the framework¶

The framework is heavily based on the python argparse module aiming for a self explanatory use of the main command line interface: pype.

The management of tools version and sources (assembly/organisms) is organized in profiles files.

 $ pype profiles
 usage: pype profiles {info,check}

 positional arguments:
 {info,check}
 info        Retrieve details from available profiles
 check       Check if a profile is valid

$ pype profiles info --all
     default:          Default profile b37 from 1k genome project
     hg19:             Profile for the UCSC hg19 assembly
     hg38:             Profile for the hg38 assembly human genome

Each profile is defined by a YAML file, an example:

info:
    description: HG19 default profile
    date:        10/03/2016
genome_build: 'hg19'
 files:
    genome_fa:    /path/to/b37/genome.fa
    genome_fa_gz: /path/to/b37//genome.fa.gz
    exons_list:   /path/to/exome_hg37_interval_list.txt
    db_snp:       /path/to/dbSNP/dbsnp_138.b37.vcf.gz
    cosmic:       /path/to/cosmic/CosmicCodingMuts.vcf.gz

 programs:
    samtools_0:
        path: samtools
        version: 0.1.18
    samtools_1:
        path: samtools
        version: 1.2

Each tools is wrapped into a python function (snippets), which is designed to load program specified in the selected profile, and usually execute the tools using the subprocess module.

$ pype snippets
error: too few arguments
usage: pype snippets [--log LOG]
                {biobambam_merge,mutect,bwa_mem,freebayes}
                ...

positional arguments:
{biobambam_merge,mutect,bwa_mem,freebayes}
biobambam_merge     Mark duplicates and merge BAMs with biobambam
mutect              Call germline and somatic variants with mutect
oncotator           Annotate various file format with Oncotator
bwa_mem             Align fastQ file with the BWA mem algorithm
freebayes           A haplotype-based variant detector

The tools can be chained and combined one-another within a YAML files (pipelines), which specify dependencies and input files. The profile is parsed to provide command line option by the argparse interface

$ pype pipelines
error: too few arguments
usage: pype pipelines [--queue {msub,echo,none}] [--log LOG]
                 {bwa,freebayes,mutect,bwa_mem}
                 ...

positional arguments:
{bwa,delly,freebayes,mutect,bwa_mem}
bwa                 Align with bwa mem, merge different lanes and
                    performs various QC stats
freebayes           Freebayes pipeline from Fastq to VCF
mutect              Mutect from fastq files
bwa_mem             Align with bwa mem and performs QC stats

optional arguments:
--queue {msub,echo,none}
                   Select the queuing system to run the pipeline
--log LOG          Path used to write the pipeline logs. Default
                   working directory.

A typical scenario that demonstrate the advantages offered by this package is the alignment of two DNA sequencing (NGS) samples, a normal and a tumor samples, and perform a comparison between two samples to find somatic mutations. In the command line this task require for each sample a command for the alignment, a command for sorting and a command for indexing. After the two samples are processed the results need to be compared by a mutation caller, usually specifying some database such as COSMIC or dbSNP.

The easiest and fastest way to implement the tasks is by wrapping the command line into bash script and submit them to a queuing system, or run it directly in a shell. This would impose a self organization of logs and scripts, and still possibli results in a lot of maintenance to keep the scripts up to date.

Other pipeline systems can solve the problem, but they may not be executable on the fly by command line interface (eg, need to configure file previous execution) or they may not allows to change genome build or database version easily.

with the bio_pype framework each step can be launched separately, example of usage of the alignment snippet:

pype snippets bwa_mem
usage:
pype snippets bwa_mem -h HEADER -1 F1 -2 F2 [-t TMP] [-o OUT]

optional arguments:
  -h HEADER          Header file containing the @RG groups,
                     or the comma separated header line
  -1 F1              First mate fastQ file
  -2 F2              Second mate fastQ file
  -t TMP, --tmp TMP  Temporary folder
  -o OUT, --out OUT  Output name for the bam file

or use the pipeline:

pype pipelines mutect
usage: pype pipelines mutect --tumor_bam TUMOR_BAM --out_dir OUT_DIR
                             --sample_name SAMPLE_NAME --normal_bam NORMAL_BAM
                             [--tmp_dir TMP_DIR] --tumor_bwa_list
                             TUMOR_BWA_LIST --qc_bam_tumor QC_BAM_TUMOR
                             --normal_bwa_list NORMAL_BWA_LIST --qc_bam_normal
                             QC_BAM_NORMAL

Required:
  Required pipeline arguments

  --tumor_bam TUMOR_BAM
                        Input Bam file of the tumor sample, type: str
  --out_dir OUT_DIR     Output file name prefix for the analysis, type: str
  --sample_name SAMPLE_NAME
                        Sample name or identifier of the run, type: str
  --normal_bam NORMAL_BAM
                        Input Bam file of the normal/control sample, type: str
  --tumor_bwa_list TUMOR_BWA_LIST
                        Batch file to run the bwa_mem pipeline on different
                        lanes of the tumor, type: str
  --qc_bam_tumor QC_BAM_TUMOR
                        Tumor BAM files QC directory output path, type: str
  --normal_bwa_list NORMAL_BWA_LIST
                        Batch file to run the bwa_mem pipeline on different
                        lanes of the normal/control, type: str
  --qc_bam_normal QC_BAM_NORMAL
                        Normal/Control BAM files QC directory output path,
                        type: str

Optional:
  Optional pipeline arguments

  --tmp_dir TMP_DIR     temporary folder, type: str. Default: /scatch

Python and Environment Modules:¶

example of loading the hg37 profile in the python prompt:

from pype.modules.profiles import get_profiles

profiles = get_profiles({})
profile = profiles['hg37']
genome = profile.files['genome_fa_gz']

The framework also supports Environment Modules:

from pype.modules.profiles import get_profiles
from pype.env_modules import get_module_cmd, program_string

profiles = get_profiles({})
module('add', program_string(profile.programs['samtools_1']))