ARCHIVES ON-LINE

SEARCH THE COLLECTION
For information on all members of the Collection, search by Category, Company Name, Nominating Company, Application, Country or Keywords according to your area of interest.

Datamining to solve pediatric brain tumors
SPSS and Children’s Memorial Hospital (Chicago)
Chicago, Illinois
United States

Year: 2004
Status: Finalist
Category: Medicine
Nominating Company: Morgan Stanley

A scientist at a major children’s hospital uses data mining technology to detect patterns in the large amount of data generated from a child brain tumor sample, advancing research in this important field.
Each year, nearly 3,000 children in the U.S. are diagnosed with brain tumors. Almost half will die within five years, making it the most fatal cancer among children. If a child does survive a brain tumor, the long-term effects can be significant, and can include neurological disabilities, retardation and psychological problems. Beyond surgery, successful treatments for pediatric
brain tumors are rare. Dr. Eric Bremer, director of brain tumor research at Children’s Memorial Hospital in Chicago, is one of the leading scientists searching for a better way to treat pediatric brain tumors. One of Dr. Bremer’s main goals is to build a gene expression database for pediatric brain tumors, and to then correlate this with both past and ongoing research on effective treatments. As a result of the mapping of the human genome, researchers have gained new tools to study these genetic variations, but the work can quickly produce an overwhelming amount of data. His challenge is to make sense of the 7,000 to 30,000 data points for each brain tumor sample. To do this, Dr. Bremer uses Clementine®, a data mining workbench from SPSS Inc., which enables him to quickly analyze this voluminous amount of data in different ways, and identify patterns and relationships. As an example of Clementine’s use, Dr. Bremer combined his own data with that of a publicly available data set resulting in a total of 133 tumor samples from the six major pediatric brain tumor types. Clementine classified these tumors with greater than 95 percent accuracy. He then uses SPSS’ LexiQuest™ Mine, a text mining technology, to sift
through mountains of scientific literature to extract patterns that, for example, when combined with genetic patterns identified from his brain tumor database with the help of Clementine, can be used to help him evaluate prime drug targets that would form the basis for a cancer cure – Dr. Bremer’s ultimate goal.
Background: The emerging area of bioinformatics, which is the use of computers and information technology to identify and solve biology problems, has greatly benefited medical research. Bioinformatics first attracted notice when it was used in the Human Genome Project to isolate valuable gene sequences from the DNA with no value. One of the main goals today of bioinformatics is the important task of assigning functions to the 40,000 protein-encoding genes and determining how they interact with each other. Microarray analysis provides a snapshot of gene activity for the entire genome by telling the investigators which genes are expressed for the condition under study. A single microarray analysis experiment that examines 40,000 genes from ten different samples, under 20 different conditions, produces at least eight million pieces of information. Because of the vast amounts of data that is being generated through microarray analysis, scientists like Dr. Bremer, who have mastered data manipulation, will move their critical research forward.

In the case of Dr. Bremer’s research on pediatric brain tumors, one of the first requisites for the pediatric brain cancer treatment database is accurately classifying the tumor. The vast majority of pediatric brain tumors can be categorized as gliomas, ependymomas and medulloblastomas, altogether there are 12 or so tumor types, including subtypes. Classification is very subjective because pediatric brain tumors are often difficult to distinguish by appearance, and there are few objective markers such as those found for other childhood cancers like leukemia.

In addition to tumor type, it is important to classify a tumor’s stage or grade. Cancers are generally stratified into four stages, from the most benign (stage 1) to malignant (stage IV). Neuropathologists often find it difficult to distinguish between intermediate stages. However, treatment between two stages can be drastically different, and incorrect staging can have dramatic consequences for the patient. For example, if a child with a stage two tumor is misdiagnosed with a more advanced grade, he would unnecessarily receive more aggressive treatment. This not only results in needless pain, and could also lead to long-term damage.


A key to accurate classification of pediatric brain tumors lies at the molecular level. Just as a skin cell and liver cell vary in their gene expression patterns, the same is true for different tumors or tumor grades. Dr. Bremer captures these differences with gene expression microarray experiments. Each tumor sample may generate 7,000 to 30,000 data points, and he must turn this huge amount of raw data into information he can use. In order to do this, Dr. Bremer turned to data mining, a technology that discovers the meaningful patterns and relationships in large volumes of complex data.


Using Clementine, a data mining workbench from SPSS Inc., Dr. Bremer combined his own data with that of the publicly available data set, resulting in a total of 133 tumor samples from the six major pediatric brain tumor types. Clementine, which offers a pre-built stream (a flow of data mining steps) specifically for microarray analysis, successfully classified these tumors with greater than 95 percent accuracy. While these samples were well described pathologically, they served as a test case that bodes well for future pediatric brain tumor classification, especially difficult-to-classify tumors.

However, classification using data obtained from gene expression microarray experiments is just part of a bigger picture. Expression measures certain ribonucleic acid (RNA), a chemical to a single strand of DNA), and not the final gene products. Dr. Bremer’s database will eventually incorporate clinical, pathological and biochemical information to provide as complete and accurate a picture as possible. In addition, patient outcome information will be added so that the database may reveal which treatments work best with brain tumors sharing genetic and pathological characteristics. This is the point at which bioinformatics transitions to "biomedical informatics."

Additionally, Dr. Bremer augments his microarray analysis findings with other existing research. Rather than spend his valuable time pouring over thousands of scientific articles and documents to find the information relevant to his research, he uses another analytic tool, text mining, to quickly extract and analyze important concepts contained within the documents. To do this, Dr. Bremer selected LexiQuest, SPSS Inc.’s text mining technology, a linguistics-based text mining tool that is capable of reading 250,000 pages of text per hour. LexiQuest’s extraction engine has been integrated with Clementine so the text and the data tables can be used together to present a more complete view of the research.

Dr. Bremer realizes the critical need to share data and knowledge. Clementine enables Dr. Bremer’s data to be combined with data from other researchers. To this end, he plans to deploy his data mining results via a Web server so that any pediatric brain tumor researcher can submit files from their own microarray experiments for analysis, and then receive predictive responses. In turn, the added data can be used to update and add value to the original data set; the more samples his database contains, the more accurate it will be and the more lives that will be saved.

As a result of Dr. Bremer’s and other researchers work, types of pediatric brain tumors can be more accurately diagnosed, and the life expectancy for children with brain tumors has grown from five months to 39 months.

Dr. Bremer discovered Clementine and LexiQuest Mine at the Biological Sciences and Data Mining conference in Washington DC in September 2002. Initially, SPSS Inc. consultants worked with Dr. Bremer and his staff to analyze his microarray data using Clementine, SPSS Inc.’s data mining workbench. Pre-built streams (a flow of data mining steps) called Microarray Clementine Application Templates (CAT) comprise a basic outline for analyzing gene expression data, and represent the best analytical practices in the field.

The appeal of Clementine was due largely to the many ways in which it allowed Dr. Bremer to analyze his research data. Also, it allowed him to more easily communicate his findings to his team of 10 researchers, all based in Chicago.

Dr. Bremer used two of Clementine’s predictive models, artificial neural networks and decision trees, to analyze and classify his data. Information from one model complemented the other. While the neural network resulted in a more accurate classification, it didn’t show how it actually accomplished the classification. The decision tree, however, showed precisely how the tumors were classified and revealed potential gene markers that characterized certain cancers. Once these markers are validated, labs without microarray technology could use this information to develop antibodies against the gene product (protein) as an alternate means of diagnosis.

Just as important as the model is the data that the model is based on. Microarray data analysis presents a number of challenges given the small number of samples and large number of genes. Dr. Bremer’s data set had 133 samples, but nearly 7,000 variables. The microarray Clementine Application Template is based on real-world experiences to overcome these challenges.

It is also important that differences in gene expression values genuinely reflect biological variation and not artificial differences introduced during sample preparation. Clementine helps assure this is the case by processing the data quality parameters along with the samples. Clementine also simplifies the task of feeding data into the model. Prior to Clementine, Dr. Bremer had to organize the expression values in a specific format dependent on the requirements of the analysis package. Dr. Bremer now skips this step because Clementine streams can automatically prepare and accept gene expression data directly from his database - a huge time saver.

According to Dr. Bremer, “There is no one right way to look at the data. Clementine provided me with the ability to examine data from a number of different perspectives. I can select a viewing method, save it, and reuse it later if the method proves valuable. What’s so great about Clementine is that it’s a workbench and we can keep track of exactly what we have done. It’s easy to try different methodologies to find a method that works best.”

Dr. Bremer selected SPSS Inc.’s data mining and text mining technology because of the ability to combine the analyses from both these analytical technologies to provide him with a more complete view.

Clementine provides for a full range of data mining activities. As a workbench, Clementine allows for access to multiple data stores and types, provides a variety of data preprocessing tools, multiple predictive modeling techniques, interactive visualization, methods for interpreting results and an external module interface for the addition of external tools. Clementine's Solution Publisher enables the deployment of the entire analysis (not just the model) and makes for easy updating of solutions as well.

LexiQuest Mine, which is linguistics-based and is capable of reading 250,00 pages of text per hour, is used for extracting biologically relevant concepts from large collections of text-based data, such as MedLine articles and patent documents. Additionally, LexiQuest Mine's linguistic extraction engine has been integrated with Clementine, so the text and the data tables can be used together in the total data mining process.

Dr. Bremer uses both data mining and text mining to classify genes using artificial neural network modeling that determines associations between a given tumor’s genes. Additionally, he is inspecting this data by creating a decision matrix, which can identify what genes may be related to a particular type of cancer.

With the recent high-profile successes resulting from the Human Genome Project, the attention being paid to bioinformatics is perhaps surpassed only by the voluminous amounts of data being generated by the Project and associated research. And, it is because of these two factors -- the rapidly increasing interest in bioinformatics and the large volumes of data -- that research facilities and makers of advanced analytical software are forming pioneering relationships.

Bioinformatics is still an emerging market and only recently have predictive analytics, such as data mining and text mining, been introduced as microanalysis tools. Speed and accuracy are critical to the success of medical research. The sooner a scientist can come to a solution or finding that will lead to improving the quality or longevity of lives, the better society will be.

In the case of Dr. Bremer’s research at Children’s Memorial Hospital, the opportunity to save the lives of children suffering from brain tumors or improve their quality life is an exceptional one. With the use of data mining, text mining and other technologies, Dr. Bremer has saved precious time and increased the accuracy and quality of his research. Dr. Bremer combined his own data with that of the publicly available data set, resulting in a total of 133 tumor samples from the six major pediatric brain tumor types. These tumors were successfully classified with greater than 95 percent accuracy. These classifications will help physicians more precisely diagnose pediatric brain tumors and prescribe the most effective treatment.

Technology has also enabled Dr. Bremer’s to combine his data with data from other researchers. To this end, he plans to deploy his data mining results via a Web server so that any pediatric brain tumor researcher can submit files from their own microarray experiments for analysis, and then receive predictive responses. In turn, the added data can be used to update and add value to the original data set; the more samples his database contains, the more accurate it will be and the more lives that will be saved.

As a result of Dr. Bremer’s and other researchers work, types of pediatric brain tumors can be more accurately diagnosed, and the life expectancy for children with brain tumors has grown from five months to 39 months.

The original goal of this project was to develop a molecular classification scheme for the major subtypes of pediatric brain tumors. In that sense Dr. Bremer and his team at Children’s Memorial Hospital achieved their goals, and it is operational. They have a research tool that can predict tumor class based on gene expression.

In addition, Dr. Bremer believes that the project exceeded their goals in a couple of key areas. They are now using multiple neural networks to predict class and each network gets one vote. Currently the final prediction is based on 264 individual Artificial Neural Networks (ANN)-- ANN is Dr. Bremer's preferred machine learning algorithm on the Clementine workbench instead of the classification, regression Trees, or rule induction analytic methods many in biology and chemistry prefer. Having these many models is like having multiple neuropathologists to examine the tumor. The multiple ANNs enable Dr. Bremer and his team to assign a confidence value to their class prediction. Having both optimized and non-optimal neural nets voting also make the predictor sensitive to changes during the course of an individual patient’s disease. For example, one patient they analyzed had four repeat surgeries. The first three showed a very classic ependymoma tumor. The tumor removed from the fourth surgery showed that it was still an ependymoma but with a lower confidence. Some of the ANNs were voting for more aggressive glial tumors. After this, regrowth of this patient’s tumor behaved more like an aggressive glial tumor. The research team exceeded expectations by being able to follow disease progression in an individual patient.

They have also exceeded expectations from the biological information that they were able to get from the analysis. The Children's Memorial research team were able to identify a signal transduction pathway that appears to function very differently in medulloblastomas and ependymomas tumors. They are starting to be able to identify patients that have chromosomal abnormalities in their tumors (a hallmark of aggressive tumors). The team has identified a new therapeutic target for high-grade gliomas and ependymomas, and is testing this in pre-clinical studies right now. They even hope to bring this to the Federal Drug Administration (FDA) as an investigational new drug in the next couple of years.

Dr. Bremer realizes the critical need to share data and knowledge. Clementine enables Dr. Bremer’s data to be combined with data from other researchers. To this end, he plans to deploy his data mining results via a Web server so that any pediatric brain tumor researcher can submit files from their own microarray experiments for analysis, and then receive predictive responses. In turn, the added data can be used to update and add value to the original data set; the more samples his database contains, the more accurate it will be and the more lives that will be saved.

Dr. Bremer also plans to develop their classifier as an FDA approved clinical diagnostic. They are hoping to move forward with the analysis to predict treatment choices so that the best course of treatment for the child can be started right away. They would also like to expand their analysis to adult brain tumors and eventually to other tumors and disease states.
As a result of the mapping of the human genome, researchers have gained new tools to study these genetic variations, but the work can quickly produce an overwhelming amount of data. In the case of Dr. Bremer, he had to make sense of the 7,000 to 30,000 data points for each brain tumor sample.

As in any research project of this scale, there were difficulties. First, the original data sets and models that were done on an older generation gene chip (Affy – HuFL6800), and the current chip (Affy – U133a) are vastly different. Dr. Bremer and his team faced the question of having to start over on the newer generation of chips or see if their neural net predictive models could be translated to other chips. Fortunately, they were able to translate. Dr. Bremer feels that this ability to translate speaks to the robustness of the Clementine data mining model, since it can be independent of the chip type used.

Dr. Bremer is now hoping to make the predictive process “platform” independent and thus make it accessible to a wider group of researchers (currently it is very hard to compare results from different microarry platforms).

A second problem Dr. Bremer’s team ran into relates to the text mining as a way to gain biological meaning to the gene lists generated. They wanted to use full text articles because there is a significant amount of potential information describing relationships between genes that are found in the discussion, results and introductions of papers, which are not present in abstracts or gene ontologies. Full text articles are very time consuming to assemble. To address this particular challenge, they used software to automate the downloading of full text articles from the journals to which they subscribed or had access.

Another potential problem was the quality control of the gene array data. There are many experimental factors that can influence gene expression studies and that can lead to errors in prediction. Dr. Bremer instituted strict quality control measures so that all chips could be compared on an even basis, so that the differences seen were more likely due to the biological differences between tumors.