Summary  

As the Human Genome Program begins its final assault on determining the DNA sequence of all human chromosomes, an enormous, ever-increasing amount of genomic data is being generated. However, as this data is created from a large number of laboratories, it has become difficult for researchers to be able to easily manage and utilize this data. With this in mind, we began a project (eGenome) to try to bring together relevant data from many sources for a given chromosome and to present it in a manner that is easy to use and understand. eGenome creates what we call views of individual chromosomes, which can be thought of as multidimensional representations of chromosomes which encompass various perspectives: genetic, physical, functional, cytogenetic, and clinical.


  What is eGenome?
Back to top  
 

eGenome is a sophisticated method for compiling, analyzing, and presenting information about genomes. The result of this method is an integrated data set for each specific chromosome. This data set includes genomic elements; large-insert genomic clones; DNA sequences; DNA variations; cytogenetic, genetic, and physical localizations of elements on a chromosome; and information associated with each of these elements. The data set resides in a relational database (CompDB), and this data can be searched and viewed both textually and graphically. In addition to the essential data for each genomic element that eGenome itself contains, a large number of element-specific links to additional information housed in other on-line databases is presented. Thus, eGenome provides both an instantaneous summary of genes and other genomic elements, and a comprehensive portal to additional information throughout the Internet. This Website serves as an intermediary between the database itself and the user.


  Procedures
Back to top  
 

The eGenome procedure consists of several linked methods: 1) compilation, 2) analysis, 3) integration, and 4) presentation.

1) COMPILATION

To calculate comprehensive views of chromosomes, eGenome first compiles genomic data from various existing sources. There is no new data generation, only new data analysis. The current sources of data that eGenome uses include:

Although several of the data sets listed above have been placed individually into separate resources, only eGenome integrates and delivers the totality of this information via a single interface. Note that these data sets are derived from several different experimental procedures, including radiation hybrid, genetic linkage, physical, cytogenetic, and sequence-based mapping techniques. As such, each represents slightly different perspectives of a given chromosome, each with its own strengths and weaknesses. A major objective of eGenome is to smoothly integrate each of these perspectives, and thus include the strengths that each technique offers. Pooling these data sets together also has several other benefits, including achieving greater marker and clone coverage, making possible the creation of higher resolution and more highly supported maps, and allowing for systematic management of genomic information (such as keeping track of marker names).

Back to procedures / Back to top

2) ANALYSIS

Another goal of eGenome is to place all genomic data relative to a single unifying scale, which can be defined by the draft human genomic sequence. In addition, we triangulate localization data derived from other experimental techniques (genetic linkage, cytogenetic, and radiation hybrid analyses) relative to the sequence localizations. This procedure provides independent quality assurance that defined stretches of experimental or functional significance, such as a marker or gene, localize accurately within the genome. In addition, this allows for quick identification of elements which are discordant between independent localizations.

In addition to identifying the best localizations possible, significant steps were performed to clean up the data as much as possible. To remove redundancies in the data, all DNA markers were compared, and any markers with sequence identity were combined into groups called genomic elements. An element is defined in eGenome as a DNA segment which defines a unique location in the genome. Also, associated information for each element, such as primer sequences, lab of origin, and various names and database identifiers, was collected. Any element sharing one or more markers also shared the information associated with each marker. This allows for convenient nomenclature management.”

Back to procedures / Back to top

3) INTEGRATION

Once sequence-based localizations were identified for most genomic elements, we integrated these physical localizations with genetic linkage (GL), radiation hybrid (RH), and cytogenetic-based positional information that existed for many of these elements. For each of these additional localization techniques, we created framework maps. Framework maps consist of a subset of unique genomic markers whose linear order on the chromosome has been determined with high statistical probability, usually 1,000:1 odds for each adjacent pair. For both the RH and GL approach, a method ensuring that the framework contains as many markers as possible (which maximizes the overall resolution of the framework map and thus the entire view) is used. This approach essentially builds a framework in successive rounds of mapping, placing additional markers on the framework in each round until no more can be placed with sufficient support. Cytogenetic frameworks were built by correlating existing RH marker positions with experimentally determined cytogenetic localizations. Cytogenetic bands were demarcated based upon the distribution of known markers; cytogenetic band assignments for all other markers were inferred based upon these predictions. For each localization technique, the frameworks were then used to localize the remainder of the elements relative to the appropriate framework. For example, a polymorphic marker (X) could be placed between 2 adjacent framework markers (A and B). This would mean that marker X is located between markers A and B with a probability of >1,000:1. In this way, a genomic element could have as many as four independently-derived localizations, and many elements not able to be uniquely identified within the genomic sequence could still be localized. Once the set of frameworks was established, large-insert clones and sequence contigs can be easily annotated onto the localization structure. As BAC's, PAC's and YAC's have been identified by using individual markers as probes as well as determining their base pair positions in sequenced clones, these clones can be directly annotated to the appropriate elements. The same is true for SNP's and for EST clusters.

To represent larger genomic structures, such as genes, we searched the entire set of data for matches between different elements. Elements sharing marker names, database identifiers, or EST cluster sequences were then grouped into bundles (link to bundles definition). As an example, markers A and B are derived from two different EST sequences (X and Y). However, sequences X and Y were assembled into the same EST cluster by sequence homology, so markers A and B both belong to the same cluster, and presumably to the same gene.

The essential points here are that: 1) all elements are placed to a precise sequence position if possible; 2) in addition, elements may be localized by additional independent means, which provides quality control for positioning; and 3) associated data such as cytogenetic localizations, EST clusters, and large-insert clones are associated with the appropriate elements.”

Back to procedures / Back to top

4) PRESENTATION

This Website provides a link between CompDB, which stores the information comprising a chromosome view, and you the user. The user defines a set of search criteria, and the database returns the requested information back to the user in the form of text or graphics. The best way to understand this process is to have some concept of how the data is organized in the database. Each genomic element(an RH marker, polymorphism, transcript, or bundle) has its own record, which also includes associated information such as name, primer sequences, cytolocation, associated large-insert clone, lab source, etc.

For any search that finds only a single element, that element's record is translated into a web page which displays all of the information available for that element. This web page is further divided into a set of tabs, each of which displays a subset of the data pertaining to that element. The tabs correspond to specific data subcategories, such as "Position" and "Clones and Markers". These individual records also contain links to external databases, such as GenBank and UniGene, which lead directly to additional information about that specific element. For searches that identify more than one element, a summary table of all of the elements is instead shown. As with the individual record, the summary table is also divided into tabs that separate the elements by category. This summary table includes only some of the information available for each element as well as a link to each element's individual record.

Both of these searches are text-based. An alternative viewing method is with a graphical return, where a search defines a region, all or a subset of element types within a region are found, and the data is translated into a graphical map of the defined region. These maps are viewed by the java applet Chromoscape. The graphics themselves can be customized by the viewer, and by clicking on an individual element within the graphic, the user can view the individual record for that element.

Searches can be conducted by 3 different interfaces on this Website:

Simple text searches, where the user types information into a text box (help)
Searches for chromosomal regions, where the user defines a region between two elements or map positions, and where the user specifies whether a text-based or graphical result is desired (help)
Data repository, access to compiled database files of eGenome's database contents (help)

Back to procedures / Back to top
  How to use eGenome
Back to top  
 

A description of eGenome's mission can be found in the Introduction section. Description of the actual computational process behind eGenome can be found in the Methods section.

The eGenome Website consists of four sections. A complete site index can be found here. Detailed instructions for how to use eGenome are available in the help section. Navigation of the site is best accomplished using the title bar at the top of each page.

Database searches: Interfaces for extracting data from eGenome and CompDB.
Quick Search | Region search | List whole chromosome | View whole chromosome

eGenome information: Description of the site and detailed explanations of the eGenome process. Introduction | Overview | Methods | About us | Contact us

Data: Access to the raw data, numerical summaries of the chromosome view, and analysis of transcript density and cytogenetic banding patterns. Data summary | Data repository

Index: Site-specific information and navigation. Site index | What's new | Acknowledgments

eGenome Help:Detailed explanations of various site features. Help

For more information, contact the eGenome staff.
Back to top
Except as otherwise indicated, Copyright 2005, The Children's Hospital of Philadelphia