|
The central integrating data sets used for eGenome are from the public
DNA
sequence assembly, UniSTS,
the Genome
Database, the Radiation Hybrid Database, MAP-O-MAT,
dbSNP
and UniGene. Raw data from the most recent public DNA sequence assembly
(Lander et al., 2001; (Kent
and Haussler, 2001), RHdb version 16 (Lijnzaad
et al., 1998) and from a recent UniGene build (Boguski
and Schuler, 1995) were imported into CompDB, a customized relational
database (White et al., 1999). CompDB is a
versatile instrument which variously functions as a data repository,
a data parser and organizer, an analytical tool, and a server front
end (White et al., 1999).
|
|
Integration
with the draft human genomic sequence was accomplished by determining
the sequence positions of RH,
STS (from UniSTS
and GDB), genetic linkage (GL), and SNP
elements. Primer sequences corresponding to all RH and most GL elements
in CompDB
were used to define each element. These primer sequences were used
to search for sequence positions by using the e-PCR algorithm. Elements which had perfect
matches between both primer sequences, the proper orientation, and
the correct sequence distance between primers (e-PCR parameter N00M1000W07),
were considered to have sequence localizations. In some instances,
more than one match was found, indicating that the primer sequences
were not unique. In these cases, all matches were reported. The sequence
used for this determination was a recent NCBI sequence assembly, distributed
by UCSC (see Overview
for the current version being used). For SNP's, sequence positions
already identified by dbSNP were instead used. Sequence positions
are listed in base pairs from the p terminus of the identified chromosome.
|
|
All RH markers used to score the GB4 RH panel were analyzed for primer sequence identity and assembled into a unique marker subset. In cases where a marker had more than one RH vector, a single vector was selected by using the most informative RH vector as representative for the marker. The redundancy level in RHdb (22%) was determined in this manner. The non-redundant whole-genome marker set was grouped into chromosome-specific sets by linkage grouping, with markers assigned to a group only if they linked specifically to many other markers in the group with a high lod score. Scoring data for marker subsets were then analyzed sequentially with MultiMap. For each chromosome, a small set (20-70) of Généthon microsatellite markers was used as an initial framework map. The order of markers on this map was supported with odds greater than 1000:1 and was in complete agreement with the genetic linkage-determined order (Dib et al., 1996). Markers were then analyzed against the initial framework using MultiMap to determine if they could be added to a unique position on the initial framework with statistical support. An RH framework was first constructed by adding markers with an odds threshold >10,000:1, and then with odds >1000:1 in an iterative process, preferentially adding polymorphic markers. Each marker on the resulting framework was then individually removed from the framework and re-mapped. Markers not placed to the same unique position with >1000:1 odds were removed from the framework. Final frameworks generally consisted of 1 marker/megabase on average. |
|
We then calculated the 1000:1 likelihood intervals of all GB4 RH elements remaining in a linkage group relative to the framework for each chromosome. Generally over 90% of all RH elements were placed in intervals relative to the RH framework, and approximately 70% of markers had absolute and unique matches to the draft genome sequence. The calculated location of each intervaled marker necessarily spanned three or more RH framework positions, as markers localized between two adjacent framework markers with >1,000:1 likelihood would have been added directly to the framework. Elements with only G3 RH data were placed only by DNA sequence positioning but not by RH mapping. |
|
Next,
we integrated our RH tier, which is largely composed of elements
representing known genes and EST's, with UniGene EST clusters (Boguski
and Schuler, 1995). We compared the DNA sequence IDs associated
with each RH element and EST cluster. EST clusters containing an
EST sequence identical to one assigned to a mapped RH element were
associated with the element. By establishing links between the RH
and transcript tiers, this simultaneously creates a relationship
between the physical-based RH placements and functional data associated
with each EST cluster. Gene symbols and gene descriptions for UniGene
clusters were also assigned to their corresponding eGenome elements. |
| The
genetic linkage tier is the Rutgers Combined Linkage-Physical Map
of the Human Genome (Kong
et al., 2004).
This map integrates polymorphic marker genotyping data
with DNA sequence data. The meiotic data consists of 14,759 markers
genotyped in the CEPH and/or deCODE pedigrees and includes
simple sequence repeats, RFLP/VNTRs, and SNPs. We believe this set
represents the largest combined collection of genotype data. A unique
physical position was identified from genome assembly Build 35 for
94% of PCR-based markers. The markers were carefully checked for Mendelian
inconsistencies. Chromosome assignments were initially determined
by physical data, when available, and otherwise by assignment on previously
published linkage maps. Two-point linkage was used to confirm these
assignments. Sequence-based positions were used to determine an initial
map, and linkage analysis using the meiotic data were used to either
confirm or reject the initial positions. Post-mapping error-checks
were performed, and those markers whose sequence-based positions were
not supported by the meiotic data were then tested for addition to
the map using the meiotic data only. The final map contains 11,990
markers whose physical map positions is corroborated by recombination-based
mapping data, representing 95% of the markers for which sequence positions
were available. The remaining markers (those for which sequence-based
positions cannot be determined) were added to the map using meiotic
data. The average distance between markers is 246 kb, or 0.3 cM. |
|
SNP's
were integrated as separate elements by importing all reference SNP's
from dbSNP.
This SNP set was localized to the genome by using the dbSNP-reported
sequence positions within the same draft genomic assembly that eGenome
currently utilizes. All SNP-related information reported in eGenome,
including proximity to genic structures, was generated by dbSNP, with
the exception of GL positions generated by eGenome for the small subset
of SNP's used for GL mapping above. Note that each dbSNP reference
SNP is a collection of one or more individually reported SNP's (ssSNPs)
that have been computationally determined to represent the same variation.
In eGenome, we report and provide external links to data for many
of the ssSNP's, and this externally-reported data is not necessarily
completely representative of the respective reference SNP displayed
in eGenome.
|
|
Adequate
integration with cytogenetic coordinates is often neglected in genomic
maps. We collected and utilized marker-based cytogenetic localizations
from three sources: The Genome Database, The BAC Resource Consortium, and MPIMG (Letovsky
et al., 1998; Wirth et al., 1999; Cheung
et al., 2001). These assignments were matched to their associated
markers. In addition, this cytogenetic set was further edited to
exclude spurious assignments. This process was performed by comparing
the cytogenetic assignment of a marker to its RH- and sequence-determined
positions, and then excluding those markers spanning more than one
cytogenetic band, those whose positions lie outside of a confidence
interval calculated for each band, and those whose cytogenetic position
did not agree with the sequence-and RH-derived positions (manuscript
in preparation). This created a reduced set of approximately 3,000
cytogenetic landmarks that was sorted into chromosome-specific sets.
Each chromosome set comprised a cytogenetic framework that could
be ordered using the corresponding RH position of each marker. Using
the band assignments for each of these cytogenetic framework markers,
we then assigned band positions to each RH framework marker position, either
a single band or a range of bands. Once the RH framework markers
were assigned band coordinates, cytolocations for the remaining intervaled RH and polymorphic markers
could be inferred. Cytolocations for SNP's were determined based
upon their genomic sequence position relative to the flanking genomic
sequence positions determined for adjacent RH framework markers.
In this way we inferred cytolocations for all elements in eGenome,
with the majority of elements assigned to a single cytogenetic band.
In addition, we also report all cytogenetic localizations determined
by any of the 9 groups contributing cytogenetic data for each element,
if known. |
|
Integration
with the physical map was achieved by identifying the large-insert
clones containing each particular RH and polymorphic marker. For
each element that contained an e-PCR-determined sequence position
in the UCSC sequence assembly, the large-insert
clone that was used to determine that portion of the sequence was
identified. This was performed by first determining the beginning
and end position of each clone sequence comprising a sequence contig,
then determining where each sequence contig began and ended relative
to the entire chromosome assembly, and finally determining which
clone sequence(s) included the position for a specific element.
This information was obtained from UCSC (Lander et al.,
2001; McPherson et al., 2001). Additional
information linking each BAC or PAC clone sequence to the clone's physical
characteristics was added from the NCBI Clone Registry. We also included
any large-insert clones that had been used for cytogenetic band
localization by the data sets used for cytogenetic map integration.
These clone integrations proceeded either by cross-referencing the
DNA sequence positions of the clones with those of eGenome elements,
or by directly linking markers known to be contained within the
clones with the corresponding eGenome elements for the markers.
In addition, YAC clones that had been identified by the
WICGR in a whole-genome search using a subset
of the RHdb markers were added (Hudson et
al., 1995). YAC's were integrated by comparing the primer sequences
used by WICGR to identify each YAC with marker primers
in our database. |
|
So far, we had assigned positions to each unique element, with each of these elements localized by a sequence position, a specific framework position, or by a range of framework positions. Next, we defined bundles, which are groups of markers representing higher-order, usually transcribed, genomic structures. Each element in CompDB was annotated with a series of database identifiers (IDs) from the original RHdb records of all markers comprising the amplimer. Also recorded in each element record were UniGene cluster assignments and associated NCBI labels, corresponding gene names or gene symbols, and all GenBank accession IDs for transcript, EST, and STS sequences assigned to the UniGene cluster. A cumulative list of IDs was then compiled from all elements in CompDB. Elements sharing IDs (essentially sharing an identical name, DNA sequence fragment, or EST cluster) were grouped into bundles which presumably represented transcripts or other functional genomic elements. Our bundling procedure therefore identifies inter-element relatedness. |
|
Many markers are associated with multiple names, and the reverse is also very common. Sorting through redundant and often incompletely cross-referenced nomenclature for a given locus can be a difficult and tedious undertaking. When importing element IDs into CompDB, we were careful to minimize the number of ID sources so as to maintain integrity in data representation. Thus, ID-mediated data links were built from the original RHdb records, and links were established only through identical matches between both primer pair sequences for a given marker, or between shared GenBank accession numbers. We found that by implementing this high level of stringency for relating data records, we could prevent import of most incorrect data associations without significantly decreasing the overall number of associations. To select suitable element names, we devised an algorithm that sorts through all IDs associated with each marker comprising an element and selects the most appropriate name according to a predetermined name source hierarchy. This hierarchy uses approved HUGO nomenclature whenever possible (White et al., 1997). Bundles were named in a similar manner by selecting from the pool of names for all elements comprising each bundle.
For
a more detailed description of eGenome, refer to our
manuscript. |
| 1. |
Boguski MS, Schuler GD: ESTablishing
a human transcript map. Nat Genet, 10: 369-371, 1995. |
| 2
. |
Cheung VG, Nowak N, Jang W, et al: Integration
of cytogenetic landmarks into the draft sequence of the human genome.
Nature, 409: 953-958, 2001 |
| 3. |
Dib
C, Fauré S, Fizames C, et al.: A
comprehensive genetic map of the human genome based on 5,264 microsatellites.
Nature, 380: 152-154, 1996. |
| 4. |
Hudson
TJ, Stein LD, Gerety SS, et al.: An
STS-based map of the human genome. Science, 270: 1945-1954, 1995.
|
| 5. |
Kent
WJ, Haussler D: Assembly
of the working draft of the human genome with GigAssembler. Genome
Res, 11: 1461-1462, 2001.
|
| 6. |
Kong X, Murphy K, Raj T, He C, White PS, Matise TC.: A
combined linkage-physical map of the human genome. Am J Hum Genet,
75:1143-1148, 2004. |
| 7
. |
Lander ES, Linton LM, Birren B, et al.: Initial
sequencing and analysis of the human genome. Nature, 409: 860-921,
2001. |
| 8. |
Letovsky
SI, Cottingham RW, Porter CJ, Li PWD: GDB:
the Human Genome Database. Nucleic Acids Res, 26: 94-99, 1998. |
| 9. |
Lijnzaad
P, Helgesen C, Rodriguez-Tomé P: The
Radiation Hybrid Database. Nucleic Acids Res, 26: 102-105, 1998. |
| 10. |
McPherson
JD, Marra M, Hillier L, et al.: A
physical map of the human genome. Nature, 409: 934-441, 2001. |
| 11. |
Schuler
GD: Sequence
mapping by electronic PCR. Genome Res, 7:541-550, 1997. |
| 12. |
Wang
DG, Fan J-B, Siao C-J, et al.: Large-scale
identification, mapping, and genotyping of single-nucleotide polymorphisms
in the human genome. Science, 280: 1077-1082, 1998. |
| 13. |
White
JA, McAlpine PJ, Antonarakis S, et al.: Nomenclature.
Genomics, 45: 468-471, 1997. |
| 14. |
White
PS, Sulman EP, Porter CJ, Matise TC: A
comprehensive view of human chromosome 1. Genome Res, 9: 978-988,
1999. |
| 15. |
Wirth
J, Nothwang HG, van der Maarel S, et al.: Systematic
characterisation of disease associated balanced chromosome rearrangements
by FISH: cytogenetically and genetically anchored YAC's identify microdeletions
and candidate regions for mental retardation genes . J Med Genet,
36: 271-278, 1999. |
|