Data collection and management

Back to top  

The central integrating data sets used for eGenome are from the public DNA sequence assembly, UniSTS, the Genome Database, the Radiation Hybrid Database, MAP-O-MAT, dbSNP and UniGene. Raw data from the most recent public DNA sequence assembly (Lander et al., 2001; (Kent and Haussler, 2001), RHdb version 16 (Lijnzaad et al., 1998) and from a recent UniGene build (Boguski and Schuler, 1995) were imported into CompDB, a customized relational database (White et al., 1999). CompDB is a versatile instrument which variously functions as a data repository, a data parser and organizer, an analytical tool, and a server front end (White et al., 1999).

  Sequence position assignments

Back to top  

Integration with the draft human genomic sequence was accomplished by determining the sequence positions of RH, STS (from UniSTS and GDB), genetic linkage (GL), and SNP elements. Primer sequences corresponding to all RH and most GL elements in CompDB were used to define each element. These primer sequences were used to search for sequence positions by using the e-PCR algorithm. Elements which had perfect matches between both primer sequences, the proper orientation, and the correct sequence distance between primers (e-PCR parameter N00M1000W07), were considered to have sequence localizations. In some instances, more than one match was found, indicating that the primer sequences were not unique. In these cases, all matches were reported. The sequence used for this determination was a recent NCBI sequence assembly, distributed by UCSC (see Overview for the current version being used). For SNP's, sequence positions already identified by dbSNP were instead used. Sequence positions are listed in base pairs from the p terminus of the identified chromosome.


  RH framework construction

Back to top  

All RH markers used to score the GB4 RH panel were analyzed for primer sequence identity and assembled into a unique marker subset. In cases where a marker had more than one RH vector, a single vector was selected by using the most informative RH vector as representative for the marker. The redundancy level in RHdb (22%) was determined in this manner. The non-redundant whole-genome marker set was grouped into chromosome-specific sets by linkage grouping, with markers assigned to a group only if they linked specifically to many other markers in the group with a high lod score. Scoring data for marker subsets were then analyzed sequentially with MultiMap. For each chromosome, a small set (20-70) of Généthon microsatellite markers was used as an initial framework map. The order of markers on this map was supported with odds greater than 1000:1 and was in complete agreement with the genetic linkage-determined order (Dib et al., 1996). Markers were then analyzed against the initial framework using MultiMap to determine if they could be added to a unique position on the initial framework with statistical support. An RH framework was first constructed by adding markers with an odds threshold >10,000:1, and then with odds >1000:1 in an iterative process, preferentially adding polymorphic markers. Each marker on the resulting framework was then individually removed from the framework and re-mapped. Markers not placed to the same unique position with >1000:1 odds were removed from the framework. Final frameworks generally consisted of 1 marker/megabase on average.


  RH tier

Back to top  

We then calculated the 1000:1 likelihood intervals of all GB4 RH elements remaining in a linkage group relative to the framework for each chromosome. Generally over 90% of all RH elements were placed in intervals relative to the RH framework, and approximately 70% of markers had absolute and unique matches to the draft genome sequence. The calculated location of each intervaled marker necessarily spanned three or more RH framework positions, as markers localized between two adjacent framework markers with >1,000:1 likelihood would have been added directly to the framework. Elements with only G3 RH data were placed only by DNA sequence positioning but not by RH mapping.


  EST cluster integration

Back to top  

Next, we integrated our RH tier, which is largely composed of elements representing known genes and EST's, with UniGene EST clusters (Boguski and Schuler, 1995). We compared the DNA sequence IDs associated with each RH element and EST cluster. EST clusters containing an EST sequence identical to one assigned to a mapped RH element were associated with the element. By establishing links between the RH and transcript tiers, this simultaneously creates a relationship between the physical-based RH placements and functional data associated with each EST cluster. Gene symbols and gene descriptions for UniGene clusters were also assigned to their corresponding eGenome elements.


  Genetic linkage tier

Back to top  

The genetic linkage tier is the Rutgers Combined Linkage-Physical Map of the Human Genome (Kong et al., 2004). This map integrates polymorphic marker genotyping data with DNA sequence data. The meiotic data consists of 14,759 markers genotyped in the CEPH and/or deCODE pedigrees and includes simple sequence repeats, RFLP/VNTRs, and SNPs. We believe this set represents the largest combined collection of genotype data. A unique physical position was identified from genome assembly Build 35 for 94% of PCR-based markers. The markers were carefully checked for Mendelian inconsistencies. Chromosome assignments were initially determined by physical data, when available, and otherwise by assignment on previously published linkage maps. Two-point linkage was used to confirm these assignments. Sequence-based positions were used to determine an initial map, and linkage analysis using the meiotic data were used to either confirm or reject the initial positions. Post-mapping error-checks were performed, and those markers whose sequence-based positions were not supported by the meiotic data were then tested for addition to the map using the meiotic data only. The final map contains 11,990 markers whose physical map positions is corroborated by recombination-based mapping data, representing 95% of the markers for which sequence positions were available. The remaining markers (those for which sequence-based positions cannot be determined) were added to the map using meiotic data. The average distance between markers is 246 kb, or 0.3 cM.


  SNP integration

Back to top  

SNP's were integrated as separate elements by importing all reference SNP's from dbSNP. This SNP set was localized to the genome by using the dbSNP-reported sequence positions within the same draft genomic assembly that eGenome currently utilizes. All SNP-related information reported in eGenome, including proximity to genic structures, was generated by dbSNP, with the exception of GL positions generated by eGenome for the small subset of SNP's used for GL mapping above. Note that each dbSNP reference SNP is a collection of one or more individually reported SNP's (ssSNPs) that have been computationally determined to represent the same variation. In eGenome, we report and provide external links to data for many of the ssSNP's, and this externally-reported data is not necessarily completely representative of the respective reference SNP displayed in eGenome.


  Cytogenetic assignments and analysis

Back to top  

Adequate integration with cytogenetic coordinates is often neglected in genomic maps. We collected and utilized marker-based cytogenetic localizations from three sources: The Genome Database, The BAC Resource Consortium, and MPIMG (Letovsky et al., 1998; Wirth et al., 1999; Cheung et al., 2001). These assignments were matched to their associated markers. In addition, this cytogenetic set was further edited to exclude spurious assignments. This process was performed by comparing the cytogenetic assignment of a marker to its RH- and sequence-determined positions, and then excluding those markers spanning more than one cytogenetic band, those whose positions lie outside of a confidence interval calculated for each band, and those whose cytogenetic position did not agree with the sequence-and RH-derived positions (manuscript in preparation). This created a reduced set of approximately 3,000 cytogenetic landmarks that was sorted into chromosome-specific sets. Each chromosome set comprised a cytogenetic framework that could be ordered using the corresponding RH position of each marker. Using the band assignments for each of these cytogenetic framework markers, we then assigned band positions to each RH framework marker position, either a single band or a range of bands. Once the RH framework markers were assigned band coordinates, cytolocations for the remaining intervaled RH and polymorphic markers could be inferred. Cytolocations for SNP's were determined based upon their genomic sequence position relative to the flanking genomic sequence positions determined for adjacent RH framework markers. In this way we inferred cytolocations for all elements in eGenome, with the majority of elements assigned to a single cytogenetic band. In addition, we also report all cytogenetic localizations determined by any of the 9 groups contributing cytogenetic data for each element, if known.


  DNA clone assignments

Back to top  

Integration with the physical map was achieved by identifying the large-insert clones containing each particular RH and polymorphic marker. For each element that contained an e-PCR-determined sequence position in the UCSC sequence assembly, the large-insert clone that was used to determine that portion of the sequence was identified. This was performed by first determining the beginning and end position of each clone sequence comprising a sequence contig, then determining where each sequence contig began and ended relative to the entire chromosome assembly, and finally determining which clone sequence(s) included the position for a specific element. This information was obtained from UCSC (Lander et al., 2001; McPherson et al., 2001). Additional information linking each BAC or PAC clone sequence to the clone's physical characteristics was added from the NCBI Clone Registry. We also included any large-insert clones that had been used for cytogenetic band localization by the data sets used for cytogenetic map integration. These clone integrations proceeded either by cross-referencing the DNA sequence positions of the clones with those of eGenome elements, or by directly linking markers known to be contained within the clones with the corresponding eGenome elements for the markers. In addition, YAC clones that had been identified by the WICGR in a whole-genome search using a subset of the RHdb markers were added (Hudson et al., 1995). YAC's were integrated by comparing the primer sequences used by WICGR to identify each YAC with marker primers in our database.


  Element bundling

Back to top  

So far, we had assigned positions to each unique element, with each of these elements localized by a sequence position, a specific framework position, or by a range of framework positions. Next, we defined bundles, which are groups of markers representing higher-order, usually transcribed, genomic structures. Each element in CompDB was annotated with a series of database identifiers (IDs) from the original RHdb records of all markers comprising the amplimer. Also recorded in each element record were UniGene cluster assignments and associated NCBI labels, corresponding gene names or gene symbols, and all GenBank accession IDs for transcript, EST, and STS sequences assigned to the UniGene cluster. A cumulative list of IDs was then compiled from all elements in CompDB. Elements sharing IDs (essentially sharing an identical name, DNA sequence fragment, or EST cluster) were grouped into bundles which presumably represented transcripts or other functional genomic elements. Our bundling procedure therefore identifies inter-element relatedness.


  Element and bundle name assignments

Back to top  

Many markers are associated with multiple names, and the reverse is also very common. Sorting through redundant and often incompletely cross-referenced nomenclature for a given locus can be a difficult and tedious undertaking. When importing element IDs into CompDB, we were careful to minimize the number of ID sources so as to maintain integrity in data representation. Thus, ID-mediated data links were built from the original RHdb records, and links were established only through identical matches between both primer pair sequences for a given marker, or between shared GenBank accession numbers. We found that by implementing this high level of stringency for relating data records, we could prevent import of most incorrect data associations without significantly decreasing the overall number of associations. To select suitable element names, we devised an algorithm that sorts through all IDs associated with each marker comprising an element and selects the most appropriate name according to a predetermined name source hierarchy. This hierarchy uses approved HUGO nomenclature whenever possible (White et al., 1997). Bundles were named in a similar manner by selecting from the pool of names for all elements comprising each bundle.

For a more detailed description of eGenome, refer to our manuscript.


  References

Back to top  

1.

Boguski MS, Schuler GD: ESTablishing a human transcript map. Nat Genet, 10: 369-371, 1995.

2 .

Cheung VG, Nowak N, Jang W, et al: Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature, 409: 953-958, 2001

3.

Dib C, Fauré S, Fizames C, et al.: A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature, 380: 152-154, 1996.

4.

Hudson TJ, Stein LD, Gerety SS, et al.: An STS-based map of the human genome. Science, 270: 1945-1954, 1995.

5.

Kent WJ, Haussler D: Assembly of the working draft of the human genome with GigAssembler. Genome Res, 11: 1461-1462, 2001.

6.

Kong X, Murphy K, Raj T, He C, White PS, Matise TC.: A combined linkage-physical map of the human genome. Am J Hum Genet, 75:1143-1148, 2004.

7 .

Lander ES, Linton LM, Birren B, et al.: Initial sequencing and analysis of the human genome. Nature, 409: 860-921, 2001.

8.

Letovsky SI, Cottingham RW, Porter CJ, Li PWD: GDB: the Human Genome Database. Nucleic Acids Res, 26: 94-99, 1998.

9. Lijnzaad P, Helgesen C, Rodriguez-Tomé P: The Radiation Hybrid Database. Nucleic Acids Res, 26: 102-105, 1998.

10.

McPherson JD, Marra M, Hillier L, et al.: A physical map of the human genome. Nature, 409: 934-441, 2001.

11.

Schuler GD: Sequence mapping by electronic PCR. Genome Res, 7:541-550, 1997.

12.

Wang DG, Fan J-B, Siao C-J, et al.: Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science, 280: 1077-1082, 1998.

13.

White JA, McAlpine PJ, Antonarakis S, et al.: Nomenclature. Genomics, 45: 468-471, 1997.

14. White PS, Sulman EP, Porter CJ, Matise TC: A comprehensive view of human chromosome 1. Genome Res, 9: 978-988, 1999.
15. Wirth J, Nothwang HG, van der Maarel S, et al.: Systematic characterisation of disease associated balanced chromosome rearrangements by FISH: cytogenetically and genetically anchored YAC's identify microdeletions and candidate regions for mental retardation genes . J Med Genet, 36: 271-278, 1999.
Back to top
Except as otherwise indicated, Copyright 2005, The Children's Hospital of Philadelphia