Bioinformatics Introduction

Introduction Bioinformatics is new hot topic after the Software. In the coming days there will be huge demand of Bioinformatics professionals in all sectors of biotechnology, pharmaceutical, and biomedical sciences. According to The Tribuen "Globally, the biotech computing sector is estimated to touch a whopping $30 billion by 2003 and $ 60 billion in 2005." What is Bioinformatics? Bioinformatics is the use of IT in biotechnology for the data storage, data warehousing and analyzing the DNA sequences. In Bioinfomatics knowledge of many branches are required like biology, mathematics, computer science, laws of physics & chemistry, and of course sound knowledge of IT to analyze biotech data. Bioinformatics is not limited to the computing data, but in reality it can be used to solve many biological problems and find out how living things works.
Skills Required to become successful Bioinformatician As mentioned earlier Bioinformatics profession requires wide range and it is not possible to learn all of them. Here is the important topics very essential to enter in this profession.
- Molecular Biology
- Central Dogma of molecular biology
- Experience with one or more of Molecular Biology software packages. Learn to use sequence analysis and molecular modeling software. Some of the molecular biology packages are GCG, BLAST, FASTA etc.
- Learn Unix or Linux
Since these days Unix or Linux (Free open source) is extensively used in biotechnology for is robustness and available tools & software for this platform, its very important to learn these operating system.
- Computer Programming Language like C/C++, Perl or Python, Java and HTML should be known by Bioinformatician.
- Database Management Systems
Learn Oracle and MySQL (Free Database Server) which is extensively used for store gigabytes of biotech data for further analysis.
Bioinformatician's Job Profile These days jobs available in Bioinformatics are mainly related to the design and implementation of software systems (Bioinformatics Systems) for data ware housing and analysis or DNA sequences and structure of proteins etc. Bioinformatics job may include Data Mining, DBA and development of system for Diagnostic kits, Bioinformatics software, Proteomics (Structure and function of proteins) & Genomics (Expression and functions of genes), publishing the biotechnological data & research papers on the web.
Bioinformatics course can help IT professionals, scientists and managers involved in the implementation of large Bioinformatics systems. Science students/graduates interested in biotechnology and genetic engineering can also go for Bioinformatics courses. Post Graduate digree courses is Bioinformatics are highly rewarding, but diploma (online and academic) courses provided by institutes are also equally important if you are good and highly productive on your work.
Overview of Bioinformatics
Introduction
Biology is in the middle of a major paradigm shift driven by computing technology. Although it is already an informational science in many respects, the field has been rapidly becoming much more computational and analytical. Rapid progress in genetics and biochemistry research combined with the tools provided by modern biotechnology has generated massive volumes of genetic and protein sequence data.
Bioinformatics has been defined as a means for analysing, comparing, graphically displaying, modeling, storing, systemising, searching, and ultimately distributing biological information, which includes sequences, structures, function, and phylogeny. Thus bioinformatics may be defined as a discipline that generates computational tools, databases, and methods to support genomic and postgenomic research. It comprises the study of DNA structure and function, gene and protein expression, protein production, structure and function, genetic regulatory systems, and clinical applications. Bioinformatics needs the expertise from Computer Science, Mathematics, Statistics, Medicine, and Biology.
Knowledge Base in Biology
In the last 10 years or so, numerous innovations have seen light and the consequence is the development of a new biological research paradigm, one that is information-heavy and computer-driven. As the genetic information is being made as computerized databases and their sizes are steadily growing, molecular biologists need effective and efficient computational tools to store and retrieve the cognate information such as bibliographic or biological information from the databases, to analyze the sequence patterns they contain and to extract the biological knowledge the sequences have. On the other hand, there is a strong need for mathematical methods and computational techniques for challenging computational tasks such as predicting the three-dimensional structure of the molecules the sequences represent, and to construct evolutionary trees from the sequence data. These tools will also be used to learn basic facts about biology such which sequences of DNA are used to code proteins , which other combinations of DNA are not used for protein synthesis, for greater understanding of gens and how they influence diseases.
Biology employs a digital language for represening its information using the four basic alphabets (A, C, G, T). All the chromosomes in an organism' cell have been represented and being identified using these alphabets. The demanding challenge here is to determine how this digital language of the chromosomes is being converted into the three-dimensional and sometimes four-dimensional languages of living and breathing organisms.
Information Technology in Biology
As it was found that performing all these above-mentioned tasks manually is nearly impossible due to the massive volumes of biological data and the preciseness of works, it became mandatory to use computers for these purposes. Thus this subject of bioinformatics deals with designing and deploying efficient software tools for accomplishing the above quoted tasks in a fast and precise manner. So, bridging the gap between the real world of biology and precise logical nature of computers requires an interdisciplinary perspective.
Software and Hardware Advancements in Biology
The tools of computer science, statistics, and mathematics are very critical for studying biology as an informational science subject.
Some of the recent advances happened include improved DNA sequencing methods, new approaches to identify protein structure, and revolutionary methods to monitor the expression of many genes in parallel. The design of techniques able to deal with different sources of incomplete and noisy data has become another crucial goal for the bioinformatics community. In addition, there is the need to implement computational solutions based on theoretical frameworks to allow scientists to perform complex inferences about the phenomena under study.
Genomics in the recent past has triggered the development of high-throughput instrumentation for DNA sequencing, DNA arrays, genotyping, proteomics, etc. These instruments have catalyzed a new type of science for biology termed discovery science.
Human Genome Project - An Introduction
The Human Genome Project has encouraged a series of paradigm changes to the view that biology is an informational science. The draft of the human genome has given us a genetics parts list of what is necessary for building a human: approximately 35,000 genes, their regulatory regions, a lexicon of motifs that are the building block components of proteins and genes, and access to the human variability that make us each different from one user.
Genomes - Discovering Methodology and Study
Discovery science defines all of the elements in a biological system. For example, sequence of the genome, identification and quantitation of all of the mRNAs or proteins in a particular cell type - respectively, genome, transcriptome, and the proteome. Discovery science creates databases of information, in contrast to the more classical hypothesis-driven science that formulates hypotheses and attempts to test them. The high-throughput tools both provide the means for discovery science and can assay how global information sets, for example, transcriptomes or protemes change as systems are perturbed.
The genomes of the model organisms yeast, worm, fly etc., have demonstrated the fundamental conservation among all living organisms of the basic informational pathways. Hence systems can be perturbed in model organisms to gain insight into their functioning, and these data will provide fundamental insights into human biology. From the genome, the information pathways and networks can be extracted to begin understanding their logic of life. Further more, different genomes can be compared to identify similarities and differences in the strategies for the logic of life and these provide fundamental insights into development, physiology and evolution. The first eukaryotic genome that has been fully sequenced and annotated is Saccharomyces cerevisiae. This highly helps to develop biological and computational tools for genomic and postgenomic research.
In the era of automated DNA sequencing and revolutionary advances in DNA sequence analysis, the attention of many researchers is now shifting away from the study of single genes or small gene clusters to whole genome analyses. Knowing the complete sequence of a genome is only the first step in understanding how the myriad of information contained within the genes is transcribed and ultimately translated into functional proteins. In the post genomic era, functional genomic and proteomic studies helps to obtain an image of the dynamic cell.
System Biology
Biology is a highly informational science. There are mainly two types of biological information.
- The information of genes or proteins, which are the molecular machines of life
- The information of the regularity networks that coordinate and specify the expression patterns of the genes and proteins.
All biological information is hierarchical. Initially DNA will change over to mRNA, which in turn goes to protein. Proteins enacts protein interactions, which creates some informational pathways. These pathways form informational networks, which in turn become cells. Now cells forms networks of cells. Finally an individual is a collection of cells. A host of individuals forms population and a variety of populations becomes ecologies. This evolution brings a primary challenge for researchers and scientists to create tools and mechanisms to capture and integrate these different levels of biological information and integrate it towards gaining insight of their curious functionings.
All of these paradigm shift lead to the view that the major challenges for biology and medicine in this new century will be the study of complex systems and the approach necessary for studying these biological complexities. Here comes a viable approach.
- Identify all elements, such as sequence of genomes in the system with currently available discovery tools
- Use current knowledge of the sytem to formulate a model predicting its behavior
- Perturb the system in a model organism using biological, genetic or environmental perturbations, capture information at all relevant levels, such as DNA, mRNA, protein, protein interactions, etc. and integrate the collected information
- Compare theoretical predictions and experimental data, carry out additional perturbations to bring theory and experiment into closer apposition, integrate new data into model,
- Iterate steps iii) and iv) till the mathematical model can predict the structure of the system and its systems or emergent properties given particular perturbations.
System Biology - Challenges Ahead
- The Integration of technology, biology, and computation.
- The integration of the various levels of biological information and the modeling .
- The proper annotation of biological information and its its storage and integration in databases.
- The inclusion of other molecules, large and small, in the systems approach.
- The integration imperatives of systems biology presents many challenges to industry and academia.
Conclusion
With the confluence of biology and computer science, the computer applications of molecular biology are drawing a greater attention among the life science researchers and scientists these days. As it becomes imperative for biologists to seek the help of information technology professionals to accomplish the ever growing computational requirements of a host of exciting and needy biological problems, the synergy between modern biology and computer science is to blossum in the days to come. Thus the research scope for all the mathematical techniques and algorithms coupled with software programming languages, software development and deployment tools are to get a real boost. In addition, information technologies such as databases, middleware, graphical user interface(GUI) design, distributed object computing, storage area networks (SAN), data compression, network and communication and remote management are all set to play a very critical role in taking forward the goals for which the bioinformatics field came into existence.
Definition of Bioinformatics
About Bioinformatics
In February 2001, the human genome was finally deciphered! In other words, scientists have succeeded in reading the chain of more than 3 billion base pairs that constitute the DNA molecule of humans; this process is called, sequencing . That daunting task required new analytical methods created by bioinformatics. The challenge was broad: identify all the genes and associate them with specific functions (field of genomics ), predict the structure of the proteins for which they code (field of proteomics ), and compare the roles of certain genes with those of other species in the living world (using biochips , for example).
The Definition of Bioinformatics
Bioinformatics is the analysis of biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research. Bioinformatics is more of a tool than a discipline, the tools for analysis of Biological Data.
The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as:
"Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information."
From Webopedia:
The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research. Bioinformatics is being used largely in the field of human genome research by the Human Genome Project that has been determining the sequence of the entire human genome (about 3 billion base pairs) and is essential in using genomic information to understand diseases. It is also used largely for the identification of new molecular targets for drug discovery.
The three terms bioinformatics, computational biology and bioinformation infrastructure are often times used interchangeably. These three may be defined as follows:
- bioinformatics refers to database-like activities, involving persistent sets of data that are maintained in a consistent state over essentially indefinite periods of time;
- computational biology encompasses the use of algorithmic tools to facilitate biological analyses; while
- bioinformation infrastructure comprises the entire collective of information management systems, analysis tools and communication networks supporting biology. Thus, the latter may be viewed as a computational scaffold of the former two.
Path to the Bioinformatics
- First Learn Biology.
- Decide and pick a problem that interests you for experiment.
- Find and learn about the Bioinformatics tools.
- Learn the Computer Programming Languages.
- Experiment on your computer and learn different programming techniques.
The computer has become an essential tool for the biologist just like the microscope. Eventually the Bioinformatics will become an integral part of the biology.
History of Bioinformatics
The Modern bioinformatics is can be classified into two broad categories, Biological Science and computational Science. Here is the data of historical events for both biology and computer science.
Introduction:
The history of biology in general, B.C. and before the discovery of genetic inheritance by G. Mendel in 1865, is extremely sketch and inaccurate. This was the start of Bioinformatics history. Gregor Mendel. is known as the "Father of Genetics". He did experiment on the cross-fertilization of different colors of the same species. He carefully recorded the data and analyzed the data. Mendel illustrated that the inheritance of traits could be more easily explained if it was controlled by factors passed down from generation to generation.
The understanding of genetics has advanced remarkably in the last thirty years. In 1972, Paul berg made the first recombinant DNA molecule using ligase. In that same year, Stanley Cohen, Annie Chang and Herbert Boyer produced the first recombinant DNA organism. In 1973, two important things happened in the field of genomics. The advancement of computing in 1960-70s resulted in the basic methodology of bioinformatics. However, it is the 1990s when the INTERNET arrived when the full fledged bioinformatics field was born.
BioInformatics Tools
The Bioinformatics tools are the software programs for the saving, retrieving and analysis of Biological data and extracting the information from them.
Factors that must be taken into consideration when designing these tools are:
- The end user (the biologist) may not be a frequent user of computer technology and thus it should be very user friendly.
- These software tools must be made available over the internet given the global distribution of the scientific research community.
The Bioinformatics Tools may be categorized into following categories:
- Homology and Similarity Tools
- Protein Function Analysis
- Structural Analysis
- Sequence Analysis
Homology and Similarity Tools
The term homology implies a common evolutionary relationship between two traits -whether they are DNA sequences or bristle patterns on a fly's nose. Homologous sequences are sequences that are related by divergence from a common ancestor. Thus the degree of similarity between two sequences can be measured while their homology is a case of being either true of false. This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and function have been elucidated.
Protein Function Analysis
Function Analysis is Identification and mapping of all functional elements (both coding and non-coding) in a genome. This group of programs allow you to compare your protein sequence to the secondary (or derived) protein databases that contain information on motifs, signatures and protein domains. Highly significant hits against these different pattern databases allow you to approximate the biochemical function of your query protein.
Structural Analysis
This set of tools allow you to compare structures with the known structure databases. The function of a protein is more directly a consequence of its structure rather than its sequence with structural homologs tending to share functions. The determination of a protein's 2D/3D structure is crucial in the study of its function.
Sequence Analysis
This set of tools allows you to carry out further, more detailed analysis on your query sequence including evolutionary analysis, identification of mutations, hydropathy regions, CpG islands and compositional biases. The identification of these and other biological properties are all clues that aid the search to elucidate the specific function of your sequence.
Bioinformatics Tools
BLAST:
The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences against others in public databases, now comes in several types including PSI-BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human, microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences.
FASTA
A database search tool used to compare a nucleotide or peptide sequence to a sequence database. The program is based on the rapid sequence algorithm described by Lipman and Pearson. It was the first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word".
EMBOSS
EMBOSS (The European Molecular Biology Open Software Suite) is a new, free open source software analysis package specially developed for the needs of the molecular biology user community. Within EMBOSS you will find around 100 programs (applications) for sequence alignment, database searching with sequence patterns, protein motif identification and domain analysis, nucleotide sequence pattern analysis, codon usage analysis for small genomes, and much more.
Clustalw
ClustalW is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences, calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.
RasMol
It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to use program.
Application Programs
JAVA in Bioinformatics:
Due to Platform independence nature of Java, it is emerging as a key player in bioinformatics. Physiome Sciences' computer-based biological simulation technologies and Bioinformatics Solutions' PatternHunter are two examples of the growing adoption of Java in bioinformatics.
Perl in Bioinformatics:
Perl is also being used in the processing of biological data. One example of perl project is BioPerl project.
Bioinformatics Projects:
BioJava:
The BioJava Project is providing the Java tool for the processing of data in Java
BioPerl:
The BioPerl project many module for biological data processing.
BioXML:
A part of the BioPerl project, this is a resource to gather XML documentation, DTDs and XML aware tools for biology in one location.
Application of Bioinformatics in various Fields
Bioinformatics is the use of IT in biotechnology for the data storage, data warehousing and analyzing the DNA sequences. In Bioinfomatics knowledge of many branches are required like biology, mathematics, computer science, laws of physics & chemistry, and of course sound knowledge of IT to analyze biotech data. Bioinformatics is not limited to the computing data, but in reality it can be used to solve many biological problems and find out how living things works.
It is the comprehensive application of mathematics (e.g., probability and statistics), science (e.g., biochemistry), and a core set of problem-solving methods (e.g., computer algorithms) to the understanding of living systems.
Bioinformatics is being used in following fields:
- Molecular medicine
- Personalised medicine
- Preventative medicine
- Gene therapy
- Drug development
- Microbial genome applications
- Waste cleanup
- Climate change Studies
- Alternative energy sources
- Biotechnology
- Antibiotic resistance
- Forensic analysis of microbes
- Bio-weapon creation
- Evolutionary studies
- Crop improvement
- Insect resistance
- Improve nutritional quality
- Development of Drought resistance varieties
- Vetinary Science
Bioinformatics Resources on the Web
Here is some of the Bioinformatics Resources on the Internet.
- Search Databases
different searches against different databases
- General Nucleotide Sequence Databases
Some general nucleotide sequence databases
- Specific Human Genome Databases
Collection of human genome databases
- Specific Genome Databases of all Other Species
Collection of genome databases of all other species
- Online Tools and Protocols
Online Tools and Protocols links
- Bio-Journals -- a big collection
This is a combination of Pedro's Collection, Springer, Oxford, and APNet, updated by us.
- NCBI - Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.
- EBI - The European Bioinformatics Institute (EBI) is a non-profit academic organisation that forms part of the European Molecular Biology Laboratory (EMBL).
- DDBJ - DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG).
DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank.
DNA sequence records organismic evolution more directly than other biological materials and thus is invaluable not only for research in life sciences but also human welfare in general. The databases are, so to speak, a common treasure of human beings. With this in mind, we make the databases online accessible to anyone in the world.
- Feature Table Definition - the format of entries in these databases. DNA Data Bank of Japan, Mishima, Japan. EMBL Nucleotide Sequence Database, Cambridge, UK.GenBank, NCBI, Bethesda, MD, USA.






