http://www.ncbi.nlm.nih.gov/About/Doc/hs_genomeintro.html
An Introduction to NCBI's Genome Resource An Introduction to NCBI's Genome Resource Pubmed Entrez BLAST OMIM Taxonomy Structure Search Genbank Genomes Locuslink Pubmed OMIM Proteins Structures for NCBI SITE MAP FAQS--Facts and Figures--Press_release A Story of Discovery More Information on Assembly and Annotation Resource Links Human Genome Page Tour the Draft Human Genome Building a Genomic Information Infrastructure at NCBI A_major challenge of the Human Genome Project is to organize, analyze, and interpret the flood of data emerging from sequencing projects worldwide. NCBI's Web_site strives to offer an integrated one-stop, genomic information resource for data that promises to provide new insights into human biology and new approaches for combatting disease. The Human Genome Project The Human Genome Project is_financed a publicly international research effort whose goal is to decipher the human genetic code and to provide this data freely and rapidly to the public. On June 26,2000, members of the Human Genome Project announced that they had_succeeded in sequencing a'working draft'of the human genome. An article published in the February 15,2001 issue of the journal Nature outlines the strategies and methodologies employed by this group to generate the draft sequence. Sequencing of the human genome represents a scientific milestone and the data is of immediate use in many important ways. In_order_to further understand and utilize on the information coded for in_this'human blueprint, 'the National Center for Biotechnology Information (NCBI) provides access to this data worldwide through its public Web_site (http://www. ncbi. nlm. nih . gov). The Big Picture: Integrating Vast Quantities of Disparate Data Sequencing of the human genome signifies the beginning of an exciting new era of science. As an international leader in the field of computational biology and bioinformatics, the NCBI is playing an active and collaborative role in further deciphering the human genome. NCBI investigators have_designed and developed, as_well_as manage and operate, a number of unique and powerful public databases essential to the Project. For_example, Genbank is the NIH sequence database maintained by the NCBI that stores the sequence data generated by the centers involved in the Human Genome Project. Genbank is one of three databases that makes_up the International Nucleotide Sequence Database collaboration. NCBI's partners in_this effort include the European Bioinformatics Institute in the United_kingdom and the National Institute of Genetics in Japan. All three institutions work together to make the sequence data generated by the Human Genome Project rapidly and freely accessible to scientific communities worldwide. NCBI investigators are also developing and enhancing software tools that will enable gene discovery. These tools--also freely accessible to the public--are being_used by NCBI to assemble, annotate, and analyze the human genomic sequence, as_well_as the genomic sequences of other model organisms. These sophisticated tools allow researchers to store, organize, analyze, and integrate vast quantities of diverse data, such_as DNA and protein sequences, gene and chromosome maps, and protein structures. Information derived from these studies has_allowed researchers to make new connections between seemingly disparate data and to shape more biologically meaningful views of this data. Assembling the Human Genome Anyone with a computer and an Internet connection can now explore the draft sequence of the human genome. A companion site has_been_designed to jumpstart an individual who wants to make use of this information, but is not sure where or how to start. NCBI released its first assembled view of the human genomic sequence. This assembly is_based not_only on the finished and draft sequences deposited by the Human Genome sequencing centers in Genbank, but_also on sequences contributed to Genbank by individual scientists from around the world. Hence, this resource is truly an"international public sequencing effort.""Assembling the sequences is an ongoing process that involves many different steps before the data may_be_merged into segments of contiguous DNA. NCBI continues to improve the genome assembly by incorporating new data, filling in existing gaps, and increasing overall accuracy. Annotating the Human Genome A team of NCBI scientists is_engaged also in_the_process_of annotating, or labeling, the biologically important areas of the genome. Annotation permits researchers to analyze the data in a systematic, comprehensive, and consistent manner. There_are two tasks involved in annotation. The first is the correct placement of known genes into the proper genomic context and the second is the prediction of previously unknown genes based_on the assembled genomic sequence. Aligning Known Genes In the first task, MESSENGER_RNAS (mrna) from the NCBI Refseq collection--a non-redundant set of reference sequences, including genomic contigs, mrnas of known genes, and proteins--are_placed on the genome primarily by sequence alignment using tools developed at NCBI. Computer modeling is_used to compensate for and overcome various problems associated with aligning the genomic and mrna sequences. With Map Viewer, one may visualize genes and genomic markers within the context of additional data. The human genome is also being_annotated with additional biological features. Examples include markers for sequence variation such_as SNPS, or single nucleotide polymorphisms, and genomic position landmarks such_as sequenced tagged sites (STSS. These features may_be_viewed using the NCBI Map Viewer, an online tool that allows you to view an organism's complete genome, as_well_as integrated maps for each chromosome and sequence data for a region of particular interest. Predicting Novel Genes The whole genomes of over 800 organisms can now be found on NCBI's Entrez Genomes Website, representing both completely sequenced organisms and organisms for which sequencing is in_progress. Various computational approaches are also being_used by NCBI investigators to accomplish the second task--predicting novel genes. Alignment with small snippets of expressed genes called Expressed Sequence Tags (ESTS) identifies new genes to be_placed on the DNA sequence and also provides information on alternative gene splicing. Use of protein similarity analyses and gene prediction programs developed at NCBI identifies additional predicted genes. Comparative genomics, or the study of similar genes in different species, is another powerful tool for predicting and identifying new information. The genomic sequence of the mouse will_be particularly helpful in_this regard_as mammals share many basic biological functions. Gene sequences in the mouse and human often code for similar proteins that carry_out comparable biological functions. Comparing the genomic sequences from other model organisms, such_as those from the rat, zebrafish, fruit_fly, and yeast, will also facilitate gene annotation. Guides to Inherited_diseases The Online Mendelian Inheritance in Man database or OMIM, is a catalog of inherited human disorders and their causal mutations, authored and edited by Dr. Victor A. Mckusick and developed for the Web by NCBI. OMIM entries are_linked often to a reference mrna sequence from Refseq, facilitating the alignment of a mrna to a gene sequence on the working draft. From OMIM, one can link to NCBI's Genes and Disease Web_page, a site designed to introduce users to the relationship between genetic factors and human disease. Genes and Disease provides information for greater than 70 genetic_diseases, with links to related databases and allied resources. Literature Databases To validate the findings generated through computer-based comparative analysis, it is essential to consider the results of wet-bench biology reported in the scientific literature. Therefore, the integration of scientific data with the literature is a necessary step for creating a unified information resource in the life_sciences. To this end, individuals are provided_with a direct link from numerous NCBI resources to Pubmed, NCBI's literature retrieval system. Pubmed provides Web-based access to over 11 million citations, abstracts, and indexing terms for journal articles in the biomedical_sciences. It also includes links to full-text journals. Pubmed Central (PMC), a digital archive of life_sciences journal literature, was_launched in January 2001 and offers a new model for electronic scientific communication and data retrieval. The value of Pubmed Central, in_addition_to its role as an archive, lies in what can be_done when data from diverse sources is_stored in a common format in a single repository. PMC currently provides free and unrestricted access to_the_full text of life_sciences journals. Model Organisms for Biomedical Research The public mouse sequencing effort is also proceeding rapidly. The desire to accumulate mouse genome sequences builds on the completion in June 2000 of the working draft version of the human sequence. The ultimate goals of the project include the construction of a physical map and a_high quality, finished sequence of the mouse, as these data will provide an essential tool to identify and study the function of human genes. The mouse genome sequence will also increase the ability of scientists to use the mouse as a model system to study and understand human disease. All sequence data generated from this project are_deposited rapidly in Genbank. Data is available from NCBI's Trace Archive database. The mouse reads are currently being_compared to the human genome and homologous reads have_been_laid out along the human draft sequence. Mouse data is also being_accumulated in both the Refseq and Locuslink databases and investigators have_begun to assemble the dataset in_order_to generate larger contigs. The mouse reads are of immediate use for both human and mouse genetics and there_are already examples of mouse genes that have_been_cloned using the available public information. The mapping and sequencing of the genomes of all model organisms are critical to the effort to characterize, sequence and interpret the human genome. Therefore, NCBI is also working towards the development and expansion of resources to facilitate biomedical research using other model organisms, including the rat, S. cerevisiae (budding yeast), C. elegans (round worm), D. melanogaster (fruit_fly), and Arabidopsis thaliana (a small flowering_plant). Building an Information Infrastructure The genomic information resources developed and disseminated by NCBI investigators have_contributed significantly to the advancement of the basic sciences and serve as a wellspring of new methods and approaches for applied research activities. The value of these integrated resources will continue to grow, as NCBI has_made a long-term commitment to meet the challenge of designing, developing, disseminating, and managing the tools and technologies enabling the gene discoveries that will significantly impact health in the 21st century. Potential Applications and the Future of Medicine SNPS are ideal elements for constructing a genomic map to aid in analyzing the human genome especially as they have a significant influence on disease processes. Analysis of the draft human genomic sequence has_led already to the identification of genes for cystic_fibrosis, breast_cancer, hereditary deafness, hereditary skeletal disorders, and a form of diabetes--just to name a_few. The draft sequence has also been_used to identify an enormous number of SNPS, or single base variations in the genetic code that play a significant role in the disease process. These discoveries, as_well_as future discoveries, will_have a profound impact on the future conduct of biomedical research. The translation of basic science advances into the clinical arena promises to revolutionize the practice_of_medicine. In the coming years, clinicians will_be able to help their patients in ways they never thought possible. Physicians will_be able to rapidly diagnose existing genetic_diseases; predetermine genetic risk for developing a disease; design novel therapeutic agents for the treatment and prevention of disease, rather than the treatment of the underlying symptoms; and prescribe a medical intervention based_on a person's genetic information, reducing the chance of an allergic, or otherwise detrimental, drug reaction. Our Interactive Web_site NCBI's Human Genome site pulls together a suite of its key resources available for human genome research. Through this interactive Web_site, researchers may: Access the draft human genomic DNA sequences generated by the Sequencing Centers involved in the Project; View and explore NCBI's assembled and annotated version of the human genome--either chromosome by chromosome or by searching for biologically important regions of the genomic sequence; and Apply one of NCBI's myriad of sophisticated software tools to_further analyze a portion of the genomic sequence that may_be of particular interest. Revised January 29,2002