Expertise in data analysis at WZW

Storage and analysis of large datasets generated by Next Generation Sequencing are challenging. In the following we provide an overview of software platforms currently in use at the WZW to identify colleagues able to give advice.

De novo Assembly of microbial genomes

(Chair of Microbial Ecology, Prof. Siegfried Scherer)

Data processing and analysis is in general performed via the use of published open-source bioinformatic programs. First, quality control is performed to ensure that only raw sequence data showing high base-call accuracy will be used for downstream data processing and analysis. During quality control reads are also trimmed in addition to high-quality filtering. Subsequently, high-quality reads are assembled using up to five broadly used bioinformatic assembly programs. Finally, metrics circumscribing assembly contiguity and accuracy are calculated based on the contig data set resulting after assembly.

Prerequisites for data processing and analysis are high-performance computational resources using Linux operating systems as well as an experienced handling of command line tools.

Used tools and software packages are: SolexaQA, FastQC, NGS QC Toolkit, Picard, Samtools, Bamtools, ABySS, SPAdes, Edena, MaSuRCA, Velvet.

Genomatix Mining Station and Genome Analyzer

(Chair of Molecular Nutritional Medicine, Prof. Martin Klingenpor)

This comprehensive software suite by the company Genomatix provides a user-friendly point-and-click surface for the complete data processing pipeline from raw data to advanced analyses. Specific sets of routines are available pre-designed for different types of sample material (RNA-Seq, DNA-Seq, ChIP-Seq, smallRNAs, methylation). Beyond mapping, it is possible to perform pairwise comparisons, generate candidate lists, perform pathway analyses, export data into other formats (including MS-Excel) and much more.

The advantage of the Genomatix software suite clearly is the ease of use for natural scientists not specifically trained in bioinformatics and the variety of integrated analysis options.

Genomatix Mining Station and Genome Analyzer are being run by the chair of Molecular Nutritional Medicine on a dedicated server on campus. Contact person for access and advice are Caroline Kless (71-2365) and Yongguo Li (71-2368).

RNA und Small-RNA Seq

(Chair of Physiology, acting head Prof. Michael W. Pfaffl)

Data analysis is based on diverse, freely available software tools with a focus on the analysis of small RNA sequencing data. The range of tools covers handling of raw data all the way to final differential comparisons. Sequencing data is trimmed where necessary and subjected to a number of quality control tests (Phred-Score, insert length distribution, tests for Sequence length bias and GC content bias). Besides mapping to available reference genomes, data can also be aligned to specific databases for the targeted evaluation of RNA fractions (Rfam, mirbase, piRNA...). Mapped reads are normalized by their distribution and analyzed for differential regulation by diverse algorithms. In addition, differential analysis results can be visualized by PCAs and heat maps.

Required is a powerful computer running a Linux partition and expertise in the use of command-line-tools and R.

Software tools and R packages include: BTrim, FastQC, BowTie, HTSeq, SamTools, NOISeq, DESeq, gplots, pcaMethods.

Illumina-based analysis of bacterial communities by 16S ribosomal RNA sequencing

(Junior Research Group Intestinal Microbiome, Dr. Thomas Clavel)

The protocol allows to date sequencing of up to 300 samples in parallel at a depth of >15 000 sequences per samples from various environments. Libraries are prepared from DNA extracted after mechanical lysis of microbial cells using primers spanning the V3/V4 region of 16S rRNA genes. Data analysis follows high-quality standards and is based on in-house developed pipelines and the use of open source software and applications, including UPARSE, QIIME and the RDP. After sequencing in paired-end modus, reads are demultiplexed and checked for quality and the presence of chimeras. After binning of sequences at the desired threshold of similarity, datasets are filtered based on abundance and prevalence cutoffs to avoid analysis of spurious operational taxonomic units. Downstream analyses include calculation of alpha-diversity indices (phylotype richness) and phylogenetic distances followed by multidimensional or dendrogram analysis (beta-diversity), statistical assessment of changes in composition after taxonomic classification, and per-case analysis of associations between bacterial readouts and environmental, host phenotypic or genotypic parameters.