next up previous contents index
Next: HSPVdb – the Human Up: Contributed talks. Tuesday December Previous: Discussion about ROC curves   Contents   Index

Handling next generation sequence data: a pilot to run data analysis on the Dutch Life Science Grid

Barbera DC van Schaik, Angela CM Luyf, Silvia D Olabarriaga, Tristan Glatard, Frank Baas, Antoine HC van Kampen

Bioinformatics Laboratory and Neurogenetics Laboratory, Academic Medical Center, Amsterdam, the Netherlands, http://www.bioinformaticslaboratory.nl/

Abstract.High throughput sequencing methods make it possible to perform in a few hours what previously took one year with the Sanger sequencing method. This new method allows for a wide range of new applications, such as large scale mutation screening, sequence based expression studies, ChIP-on-sequencing, etc. The large amount of data and new types of analysis offer interesting challenges to the bioinformatics field. One run with a new sequencing machine generates a lot of data, up to 2 GB excluding the raw images. Existing programs for sequence analysis often analyse one sequence at the time and generate a bulk of output that no longer can be manually examined; therefore, new ways to summarize and visualize the data are needed. Moreover, the data volume will present challenges for the ICT infrastructure, requiring larger storage and computing capacity, as well as advanced ways for data management, analysis and sharing among researchers.

In 2007 a Roche FLX (454) system has been installed at the sequencing facility of the Academic Medical Center of the University of Amsterdam. One sequence run of 7.5 hours generates 400,000 sequences of approximately 250 base pairs. We are collaborating with researchers of this facility to develop new analysis methods for these sequences. Perl applications have been developed for the alignment of these sequences against reference sequences for variation detection. Because the amount of data that is generated by the new sequencers is growing and the computation time is increasing, we set up a pilot as part of the BioAssist programme of the Netherlands Bioinformatics Centre (NBIC) to run the applications on the Dutch Life Science Grid that is currently implemented as part of the BIG GRID programme.

We used the software platform developed in the Virtual Laboratory for e-Sciences for medical imaging (functional MRI) and applied it to DNA sequence analysis. The data is located on grid storage resources, and the data analysis pipeline is implemented by workflows that combine services to perform data preparation (format conversion and pattern matching in the sequences) and sequence alignment. The workflows are prepared in the Taverna workbench, described using the Scufl language, and executed on the GRID with the MOTEUR workflow engine. The user can start and monitor grid workflows from the graphical user interface of the Virtual Resource Browser (VBrowser). By using this set-up we can benefit from the large storage and computing capacity of the grid. Workflows and packaging the applications as services allows for sharing of tools and easier reuse of already build components, which will facilitate building an extensive toolkit for next generation sequencing.

References: - NBIC; www.nbic.nl - Big Grid, the Dutch e-science Grid, http://www.biggrid.nl/ - Virtual Laboratory for e-Sciences, http://www.vl-e.nl - Virtual Laboratory for functional MRI, http://www.science.uva.nl/ silvia/vlfmri - Taverna, http://taverna.sourceforge.net/ - VBrowser, http://www.science.uva.nl/ ptdeboer/vlet - MOTEUR, http://rainbow.i3s.unice.fr/wiki/dokuwiki/doku.php?id=public_namespace:moteur


next up previous contents index
Next: HSPVdb – the Human Up: Contributed talks. Tuesday December Previous: Discussion about ROC curves   Contents   Index
Andra Waagmeester 2009-01-22