ReadMe file for eGenome v1.00, 01-22-02

eGenome is a procedure which creates high resolution, high confidence "views" of human chromosomes. This procedure combines genetic linkage, cytogenetic, structural, and expression-based data together onto a single platform. Version 1.00 is inclusive for the entire human genome. Details regarding the methodology can be found at: http://genome.chop.edu/access/pages/about/methods.html.

Data for version 1.00 is from RHdb version 16; UniGene build 146; UCSC sequence assembly 04-01; WICGR human map release 12 and SNP release 1; CEPHdb v9;. UWDMB, NCI-CCAP, CHORI, and CSMC large-insert clone data sets as of 8-21-01; Genome Database, UWDMB, NCI-CCAP, RPCI, CSMC, and MPIMG cytogenetic data sets as of 11-9-00; and the 11-06-01 NCBI Clone Registry release.

Version 1.00 consists of eight file types, named elements, element_to_alias, element_to_bundle, element_to_clone, element_to_cyto, element_to_genetic, element_to_rh, and element_to_sequence. Each file type consists of a subset of the entire eGenome data. The data sets are organized around the common set of genomic elements. For each file type, the first two columns list the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number) and the primary name chosen for the element (such as a D number or a gene symbol). The contents and format of each file type is described in detail below. Note that genomic element entries are not always unique within a single file; if there are multiple entries in any column field for an element, those entries are repeated on subsequent lines. For example, a marker with two cytogenetic localizations will have two adjacent lines listing the marker ID and name, but with different cytogenetic positions.

The eight file types have been generated on a whole genome basis, which can be found in the subdirectory allgenome/. In addition, a set of files has been generated for each specific chromosome and can be found in chrN/ (e.g. chr1/). Within the allgenome and specific chromosome directories are subdirectories for each of the eight file types. Within the file type subdirectories are .zip, .hqx, and .tgz compressed versions of the particular file type. For example, the PC, Macintosh, and Unix-compatible versions of the chromosome 7 element_to_genetic file can be found in chr7/element_to_genetic/ as element_to_genetic.zip, element_to_genetic.hqx, and element_to_genetic.tar, respectively. Each file is tab-delimited. Each subdirectory also contains a text version of the file as .txt.

Description of files

elements
Tab-delimited file. Basic data associated with each genomic element. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Column 3 (Description) is the descriptive line of text associated with elements that are known to be transcribed; source is UniGene. Column 4 (Element_type) is the type of element, either an RH marker, an RH framework marker, or a polymorphism. Column 5 (Expression_status) is the expression status of the element, either transcribed, not transcribed, or unknown. Column 6 (EST_cluster) is the UniGene EST cluster ID (Hs.#####) to which the element has been assigned, if any. Column 7 (SNP) lists one or more single nucleotide polymorphisms associated with the element. Columns 8 and 9 (Primer1 and Primer2) list the forward and reverse primer sequences used to PCR-amplify the element, respectively.

element_to_alias
Tab-delimited file. List of all aliases and external identifiers collected for each element. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Column 3 (Alias) lists an external identifier representing the element, in the form Datasource:ID (e.g. GDB:D1S228). Note that each identifier for an element is entered as a separate line in this table .

element_to_bundle
Tab-delimited file. Description of the bundle assignments and RH positions of each element grouped into a bundle. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Column 3 (BundleID) is the eGenome bundle internal identifier (the CVB ID, consisting of the prefix CVB followed by the bundle number). Column 4 (Bundle_name) is the primary name chosen for the bundle, selected by the eGenome naming algorithm. Columns 5 and 6 (Cvmax_left and Cvmax_right) specify the maximum position or interval span of bundles, recorded as left and right RH framework positions, and listed by framework marker CVE IDs. Columns 7 and 8 (Max_left and Max_right) specify the maximum position or interval span of bundles, recorded as left and right RH framework positions, and listed by framework marker primary names. Columns 9 and 10 (Max_cR_left and Max_cR_right) specify the maximum position or interval span of bundles, recorded as left and right RH framework positions, and listed by framework marker centiRay positions. Columns 11 and 12 (Max_cytolocation_left and Max_cytolocation_right) specify the maximum position or interval span of bundles, recorded as left and right RH framework positions, and listed by the cytogenetic positions of the framework markers. Columns 13 and 14 (Cvmin_left and Cvmin_right) specify the minimum overlapping interval shared by all markers within a bundle, recorded as left and right RH framework positions, and listed by framework marker CVE IDs. Columns 15 and 16 (Min_left and Min_right) specify the minimum overlapping interval shared by all markers within a bundle, recorded as left and right RH framework positions, and listed by framework marker primary names. Columns 17 and 18 (Min_cR_left and Min_cR_right) specify the minimum overlapping interval shared by all markers within a bundle, recorded as left and right RH framework positions, and listed by framework marker centiRay positions. Columns 19 and 20 (Min_cytolocation_left and Min_cytolocation_right) specify the minimum overlapping interval shared by all markers within a bundle, recorded as left and right RH framework positions, and listed by the cytogenetic positions of the framework markers. Note that some bundles have no single overlapping position and are shown only with their maximum positions.

element_to_clone
Tab-delimited file. Listing of all large-insert clones associated with each element. Note that this file describes many-to-many relationships between clones and elements. Many of the elements listed have multiple clones associated with them. Therefore, each clone, clone type, clone source, clone sequence, and element sequence position assignment for an element is listed on a separate line with the identical element ID values in columns 1 and 2. For example, an element with 2 clones has two adjacent lines listing element "A" in the element ID column (column 1), the first with "clone 1" listed in column 3, and the second with "clone 2" in column 3. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Column 3 (Clone_name) is the name of the large-insert clone(s) reported to contain this element. Column 4 (Clone_type) lists the type of large-insert clone, such as a BAC, PAC, or YAC. Column 5 (Clone_source) lists the primary laboratory or group from which the clone/element assignment was derived. Column 6 (Clone_sequence) lists the GenBank sequence accession numbers for those clones whose DNA sequences have been determined. Columns 7 and 8 (Sequence_position_in_clone_left and Sequence_position_in_clone_right) list the left and right base pair positions that the element matches in the clone sequence.

element_to_cyto
Tab-delimited file. Listing of all eGenome-determined and external assigned cytogenetic localizations for elements in eGenome. Note that this file describes many-to-many relationships between clones and elements. Many of the elements listed have multiple clones associated with them. Therefore, each clone, clone type, clone source, clone sequence, and element sequence position assignment for an element is listed on a separate line with the identical element ID values in columns 1 and 2. For example, an element with 2 external cytogenetic positions has two adjacent lines listing element "A" in the element ID column (column 1), the first with "left position 1" listed in column 7, and the second with "left position 2" in column 7. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Columns 3 and 4 (Cytolocation_left and Cytolocation_right) list the cytogenetic band or band range determined for each element from the eGenome cytogenetic band assignment algorithm. Column 5 (Other_cytolocation_source) lists the primary laboratory or group(s) from which an externally-derived cytogenetic/element assignment was derived. Column 6 (Other_cytolocation_clone) lists the large-insert clone(s) that was used for an external cytogenetic assignment. Columns 7 and 8 (Other_cytolocation_left and Other_cytolocation_right) list the cytogenetic band or band range(s) determined for each element from an external cytogenetic assignment.

element_to_genetic
Tab-delimited file. Describes the genetic linkage map positions of all eGenome polymorphic elements. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Column 3 (Chromosome) lists the chromosome that eGenome has assigned the element to by linkage grouping. Columns 4 and 5 (Rh_cvposition_left and Rh_cvposition_right) specify the RH position or interval span of an element, recorded as left and right RH framework positions, and listed by framework marker CVE IDs. Columns 6 and 7 (Rh_position_left and Rh_position_right) specify the RH position or interval span of an element, recorded as left and right RH framework positions, and listed by framework marker primary names. Columns 8 and 9 (cR_position_left and cR_position_right) specify the RH position or interval span of an element, recorded as left and right RH framework positions, and listed by framework marker centiRay positions. Columns 10 and 11 (GL_cvposition_left and GL_cvposition_right) specify the genetic linkage position or interval span of an element, recorded as left and right genetic linkage framework positions, and listed by genetic linkage framework marker CVE IDs. Columns 12 and 13 (GL_position_left and GL_position_right) specify the genetic linkage position or interval span of an element, recorded as left and right genetic linkage framework positions, and listed by genetic linkage framework marker primary names. Columns 14 and 15 (cM_position_left and cM_position_right) specify the genetic linkage position or interval span of an element, recorded as left and right genetic linkage framework positions, and listed by framework marker centiMorgan positions.

element_to_rh
Tab-delimited file. Describes the radiation hybrid map positions and associated RH data of eGenome RH elements. Note that for chromosomes with complete sequences, the RH positions of only those elements that could not be identified in the genomic sequence have been calculated, and the element_to_rh files for these chromosomes will list only these elements. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Column 3 (Rh_panel) is the radiation hybrid panel that the RH data for the element was generated from. Column 4 (RHdb_ID) is the Radiation Hybrid database record identifier number for the element RH typing. Column 5 (Rh_vector) is the RH typing data set (vector) used by eGenome for RH mapping. Column 6 (Chromosome) lists the chromosome that eGenome has assigned the element to by linkage grouping. Columns 7 and 8 (Rh_cvposition_left and Rh_cvposition_right) specify the RH position or interval span of an element, recorded as left and right RH framework positions, and listed by framework marker CVE IDs. Columns 9 and 10 (Rh_position_left and Rh_position_right) specify the RH position or interval span of an element, recorded as left and right RH framework positions, and listed by framework marker primary names. Columns 11 and 12 (cR_position_left and cR_position_right) specify the RH position or interval span of an element, recorded as left and right RH framework positions, and listed by framework marker centiRay positions.

element_to_sequence
Tab-delimited file. Contains relationships between eGenome elements, the sequences from which they were derived, and their positions in the human genomic sequence assemblies. Column 1 (CVEID) is the eGenome genomic element internal identifier (the CVE ID, consisting of the prefix CVE followed by the element number). Column 2 (Primary_name) is the primary name chosen for the element (such as a D number or a gene symbol), selected by the eGenome naming algorithm. Column 3 (Source_sequence) lists the GenBank sequence accession number(s) from which the element was created. Column 4 (Chromosome) lists the chromosome to which the element has been assigned based upon a genomic sequence assignment. Columns 5 and 6 (Sequence_position_left and Sequence_position_right) list the left and right base pair positions that the element matches in the UCSC sequence assembly.