Translational Bioinformatics

Translational Bioinformatics is a recently coined term that refers to the emerging field of using the bioinformatics techniques in healthcare, aiming at translating the genomics data being produced at large scale into much practical health benefit and clinical use.

The major challenges in translational bioinformatics include the analysis of huge amounts of data that need and the integration of different heterogeneous distributed data resources like disease, demographics databases at clinical side to literature, sequence, structure, and pathway databases at the biological side.

 

The translational bioinformatics team of the Saudi genome project works closely with the biomedical and genomics team towards the translation of produced NGS data into information and knowledge related to the health of people. The team is responsible for the following:

·       Establishing, maintaining the necessary computational infrastructure
·       Analysing the resulting NGS data, and
·       Interpreting the results and generating knowledge out of it.

 

  Computational Infrastructure

Our infrastructure is characterized by a number of features that renders its establishment as one of the very interesting informatics challenges.

Distributiveness

·       Our infrastructure is distributed over different satellite sites, each with different infrastructure capabilities. Each of these satellite sites runs certain primary analysis pipelines.

·       There is a central computational unit located in KACST and all centres should be connected to it, for the purpose of backup and for secondary analysis tasks.

·        Type and rate of data produced is different in the different centres, and the connectivity among the sites is also of varying speed.

In other words, our computational infrastructure is of distributed nature, but it should work in a synchronized fashion to balance the computational loads and achieve the best performance.

Scalability

The infrastructure is designed to process thousands of genomes generated each year. The design is modular allowing hardware expansion at minimum effort. Furthermore, for abrupt increase in load and for avoiding downtime in case of any failure, the infrastructure can elastically scale up using cloud computing resources in an automatic manner.

  Analysis Pipelines

The current basic pipelines in use for the Saudi Genome Project include the following:

.   Whole genome analysis, where the resulting NGS reads are mapped to a reference human genome for identifying point as well as segmental mutations.

.   Exome and target gene(s) analysis, where the resulting NGS reads are mapped to a reference human genome for identifying point and (not so large) segmental mutations.

.  Variant annotation pipeline, where the variants are annotated with different information (from public and proprietary databases) to refer to its involvement in the disease process.