Skip to main content


Bioinformatics Challenge Project


The Bioinformatics Challenge Project is a two-semester team project course in which all first year Bioinformatics PhD students conduct research with high-throughput data on problems proposed by biology and medical school faculty members. Students work with limited supervision to generate their own ideas and approaches, make functional predictions, develop computational tools, write progress reports, design validation experiments, and present their findings at our Systems Biology seminar. This year’s three Challenge Projects are described below. Copies of two posters from the projects are included as part of this summary. We consider the Challenge Project to be one of the outstanding learning opportunities for interdisciplinary research that has resulted from the IGERT grant.

Project 1: The rapid and accurate identification of pathogens in human tissue samples is a necessity as disease-causing pathogens remain one of the greatest public health burdens worldwide. As a result of high-throughput sequencing, it is possible to investigate the microbiome of a given clinical sample. However these samples contain a mixture of genomic sequences from various sources, which complicates the identification of pathogens. The team developed Clinical Pathoscope, a pipeline to rapidly and accurately remove host contamination, isolate viral reads, and deliver a diagnosis. To optimize the Clinical Pathoscope pipeline, data was simulated from human, bacterial, and viral genomes to create biologically realistic clinical samples which represent a diverse variety of host-pathogen landscapes. These data were then used to evaluate the accuracy, usability, and speed of multiple alignment algorithms and filtration methods. The optimal alignment algorithm and filtration method were implemented in the Clinical Pathoscope pipeline. These reads were then mapped against a viral database and assigned to their genomes of origin. Unique to other methods, Clinical Pathoscope can rapidly identify multiple pathogens from mixed samples and distinguish between very closely related species with very little coverage of the genome and without the need for genome assembly. The team demonstrated its approach using sequenced nasopharyngeal aspirate samples from children with respiratory tract infections.

Project 2: (Video included.) Diffuse large B-cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma in the United States. Forty percent of patients with DLBCL succumb to the disease, and new therapeutic approaches are needed. One such therapy is currently in clinical trials; however, the detailed biological mechanisms governing the response to this treatment in DLBCL are not well understood. Characterization of the transcriptional response to treatment is essential to understand the biological mechanisms of action of a drug. The focus of the project is the analysis of a large gene expression dataset consisting of a panel of DLBCL cell lines profiled at five time points after treatment. The team developed a novel time series analysis approach to quantify the dynamic evolution of gene expression, and applied it to the dataset to carefully characterize the response to the pharmacological perturbation. The time series analysis identifies differential expression of genes, and enrichment of biologically relevant gene sets and pathways. A custom visualization tool was created to explore the various dimensions of the results at multiple levels of biological detail. The combination of the time series analysis pipeline and the visualization tool identified both novel and previously known mechanisms of actions of the therapeutic treatment on DLBCL cell lines.

Project 3: The simultaneous simplicity and functional diversity of the sea urchin genome make it a valuable model organism for studying embryonic development. While previous genomic and transcriptomic studies have focused on the urchin S. purpuratus, the team’s efforts focus on another conventional urchin, L. variegatus (Lv). The team performed two major differential expression analyses of Lv. First, an analysis of paired-end RNA sequencing (RNA-seq) data from Lv embryos in the late gastrula stage treated with two chemicals known to disrupt skeletal patterning, specifically searching for transcripts that responded similarly to both treatments. Second, an analysis of RNA-seq data from eleven developmental time points to characterize changes in the transcriptome during embryonic development, specifically searching for transcripts that were uniquely expressed in one of these stages or expressed in all stages. k-means clustering was used to find other transcripts and patterns of interest. The results of these analyses, as well as original raw sequencing and annotation data, will be made publicly available in an L. variegatus Embryonic Development Gene Expression Database (LvEDGEdb) that the team developed.

Address Goals

This activity primarily addresses the goal of cultivating an outstanding scientific workforce. It emphasizes the goals of developing strong interdisciplinary research collaborations among our students and faculty and encourages creativity, independence, and quality in student research. With respect to the secondary goal of advancing the frontier of knowledge, the work started by the students in the Challenge Projects will be carried forward and ultimately advance our knowledge in these diverse areas.