| Clemson Community Access to CCMS Resources |
| Active Projects |
None at this time. Aris may be powered down when not in use.
| Upcoming Project Requests |
None at this time.
| Completed Projects |
Title: Solving Large Quadratic Assignment Problems
Applicant: Matthew Saltzman
Email: mjs@clemson.edu
Principal Investigator: Matthew Saltzman
Department: Mathematical Sciences
Address: Martin Hall, Box 340975
Phone:656-3185
PI Email: mjs@clemson.edu
Request Class: Research and Discovery
Request Type: CONTINUATION
Request Submit Date: 2009/05/29
Desired Begin Date: 2009-06-05
Desired End Date: 2009-08-15
Resource: CCMS
Node Hours: 1500
Nodes Required: 1
Storage: 200 RAM
Other User IDs: Peter Hahn, University of Pennsylvania
STATUS
Description: The quadratic assignment problem (QAP) is a well-known optimization problem with applications in areas such as facility layout, VLSI chip design, routing, and others. The problem is a member of the broader class of nonlinear integer programming problems, and is notorious for its difficulty--the largest instance in a well-known test set that has been solved to date involves matching only 30 objects. An important approach to solving integer programs is to construct continuous linear relaxations and solve these to get bounds for an implicit enumeration of the solution space. One particular strategy for the construction of linear relaxations constructs a hierarchy of successively tighter representations, culminating in a description of the convex hull of feasible integer solutions. Solving this problem as a linear program would yield the optimal integer solution. Unfortunately, the size of such a representation grows exponentially with the number of objects to be assigned, so cannot be solved in practice. A compromise approach is to generate a lower-level representation and use the solution as above to bound a search of the solution space. Higher-order relaxations produce progressively tighter bounds, but require progressively more memory. The proposed project involves generating and solving higher-order relaxations than have been used previously and using the results to solve larger instances of QAP than have been solved previously. Our current software is a serial FORTRAN code that solves the specially structured linear program for this level of relaxation for QAP. We project that it will require 300 CPU hours and 200 GBytes of RAM (not disk) to solve the next problem in the standard test set. Future work involves re-implementing the code to exploit multicore parallelism in the solution of the structured linear program. That project will require a single node with multiple cores and similarly large amounts of RAM. In the longer term, the methods described here can be applied to more general discrete optimization problems.
Prior Work: The previous round of support for this project resulted in a proof of optimality for a well known 26x26 QAP "challenge problem". The results have been submitted for publication. Results of a previous request for reserve time on palmetto were inconclusive, but related work using general palmetto resources to support development and testing of the CHiPPS discrete optimization code are included in work accepted for publication in a special issue of INFORMS Journal on Computing on high-throughput computing.
Title: Tree Genomes: Translating Laboratory Data into Productive Genomic Resources
Applicant: Meg Staton
Email: staton2@clemson.edu
Principal Investigator: Albert Abbott
Department: Clemson University Genomics lnstitute
Address: 51 New Cherry Str, 304C BRC
Phone:864-656-4643
PI Email: aalbert@clemson.edu
Request Class: Research and Discovery
Request Type: NEW
Request Submit Date: 2009/05/29
Desired Begin Date: 2009-06-01
Desired End Date: not specified
Resource: CCMS
Node Hours: 1000
Nodes Required: 1
Storage: 200 GB at least
Other User IDs: Stephen Ficklin (ficklin) CUGI Chun-Huai Cheng (ccheng) CUGI Christopher Saski (saski) CUGI
STATUS
Description:
I. DNA Sequence Assembly DNA sequencing, like computer hardware, is experiencing exponential growth in the speed and quantity of data as well as plummeting prices for the production of such data. An individual large sequence center such as the Broad Institute can produce in excess of 50 billion bases of DNA sequence per year, much of which is placed into the public domain. The explosion of these “next generation” sequences into the genomics world has left bioinformaticists reeling from the need to download, store, and, most difficultly, analyze the resulting terabytes of data. Next generation sequencing produces thousands or millions of sequences in a single machine run, but the sequences range from 50 base pairs to 400 base pairs of average length. Overlap and error within these individual reads must be statistically analyzed, and then the individual reads are overlapped into longer consensus sequences representing individual transcripts (a few thousand bases) up to whole chromosomes (millions of bases). This process, known as “assembling” is computationally intensive and getting ever more so with the growing number of reads that must be simultaneously analyzed. II. CCIT/CUGI collaborations The Clemson University Genomics Institute (CUGI) purchased 9 palmetto cluster nodes and 2TB of storage in 2008 from the CCIT group and has an ongoing plan to deprecate our current computational equipment and begin buying into CCIT infrastructure over the next 4 years. Also, CUGI scientists have collaborated with Jill Gemmill on multiple grant proposals in the past. A grant currently being considered would result in the acquisition of next generation sequencing technology at CUGI. If funded, this grant includes $36,000 for computers and storage to be purchased via CCIT and utilized for bioinformatics support. II. Research Plan The CUGI bioinformatics group already has many scientists inquiring as to our ability to process large scale sequencing data. Current efforts to perform these analyses have been hampered by the fact that neither the assembly software nor the algorithms have been optimized for parallel computing. The most trusted software (for example, Arachne) runs on a single processor but requires a large quantity of RAM. We hope to utilize CCIT’s existing resources to explore the capabilities we will need to perform this type of assembly on a regular basis. We have two current and two planned projects that could benefit from computational time and potentially lead to peer-reviewed papers and future funding opportunities. Additionally, running these four jobs on a high performance machine would allow us to benchmark the amount of memory, space, and time needed for various sizes and types of projects. III. Projects a. Chestnut Transcriptome Assembly (Current) The ongoing NSF-funded project “Genomic Tool Development for the Fagaceae” has produced over 1.5 million reads of an average length of 400 base pairs covering the transcriptome of the Chestnut tree. The long-term goal of this project is to understand the underlying genetic and genomic disease resistance to the Chestnut blight in Chinese chestnut trees, and to integrate this knowledge into a breeding program to restore a resistant American chestnut to the North American ecosystem. The data is being processed and annotated through the Clemson University PI Dr. Albert Abbott and the bioinformatics team at the Clemson University Genomics Institute. The grant, written in 2006, did not anticipate the advances in DNA sequencing technology and resulting increased computational needs for processing the millions of reads generated. The grant budget allowed for a recent server purchase to fulfill the needs of the project. The Dell Quad Core Xeon 3.16 GHz, 32 Gb RAM server has proved sufficient for operating the web site for the project and processing small- and mid-sized jobs. However, the attempts to use this machine for full-scale assemblies of tree transcriptomes and genomes has resulted in an unacceptable slowing of the website due to swap memory overhead and weeks of waiting without results. b. Peach Genome Assembly (Current) Multiple Clemson researchers, including Dr. Bert Abbott, Dr. Ksenija Gasic, and Dr. Doug Bielenberg, are researching the peach tree and are very exciting to utilize the peach genome to further their molecular research. Over 2.5 million reads have been deposited into the NCBI trace archive, representing 4-5X coverage of the peach genome. However, no assembly of these reads into a true genomic sequence has been made publicly available. To further their current research and open avenues for future grant writing, all three researchers have expressed interest in obtaining an assembled version of the genome. c. Chestnut Genome Assembly (Planned) The Forest Health Initiative is planning to fund a full genome sequencing effort for the chestnut tree genome. Because their first choice assembly group (DOE’s JGI) has a 2-year waiting list, CUGI has been asked to consider trying to assemble and annotate this genome. Despite the lack of assembly experience, CUGI’s previous accomplishments in bioinformatic and genomic tool development in tree species has led to this great opportunity. With a genome of over 700Mb and many millions of reads expected to start pouring in over the next 6 months, the bioinformaticists and tree biologists involved in the project are eager to find a method for completing the project and putting Clemson University on the map as a whole genome assembly leader. d. Aquilegia Genome Assembly (Planned) A previous CUGI grant to create genomic resources in Aquilegia has led to increasing interest in this species in Clemson and across the world. Similarly to the peach effort, millions of trace files for the whole genome sequence have been submitted to the public domain, but no consensus sequence has appeared. We hope to give Clemson researchers such as Dr. Huong Luo and Dr. Chris Saski a head start on writing grants and getting involved in this exciting new genome by offering them access to the whole genome sequences.
Prior Work: CITI has supported the development and hosting of the site marinegenomics.org. Collaborations resulting from this interaction led to a software design and implementation, Tripal, integrating Drupal and Chado. It enables cyberinfrastructure for genomics/bioinformatics data and scientists. It was also developed with CCIT support and is going live with an open source license by the end of June. Due to our recent purchase of nodes and terabytes from CCIT in 2008, we have yet to publish on resources explicitly utilizing that equipment. However, the chestnut project has utilized the software and the nodes from CCIT. Several publications covering these three areas (Marine Genomics, Tripal, Chestnut) are being written. Conference talks are listed below. Margaret Staton SC Bioinformatics Symposium Columbia, SC; April 14th-15th, 2009 Presented "Tripal, Software for Genomic Cyberinfrastructure Development" Fagaceae Annual Meeting, 2009 Raleigh, NC; Apr 19th-20th Presented "Tripal and the Fagaceae Genomics Web" Plant and Animal Genome, 2009 San Diego, CA; Jan 9th – 14th Presented “Genomic Tool Development for the Fagaceae” Supercomputing 2008 Austin, TX; Nov 17th -20th Presented “Cyberinfrastructure Solutions for Genomic Web Resources” Representing South Carolina Computing Consortium Virtual Organization Meeting Clemson, SC: August 8th, 2008 Presented "CUGI Bioinformatics & Cyberinfrastructure" Stephen Ficklin GMOD Annual Meeting San Diego, CA; January 15-16th, 2009 Presented "Our Drupal/Chado Integration Efforts" Plant and Animal Genome Meeting, 2009 San Diego, CA; Jan 9th -14th, 2009 Presented "A Review Of The Marine Genomics Project, A Web-Based Genomic And Transcriptional Database"
Other Comments: Regarding the resource information section above, we are very flexible and open to ideas from CCIT's knowledgeable staff. Our best guess is that no project will run for more than two weeks if it is not using swap memory. A worst case scenario is 336 hours for each of 4 projects, so ~1344 hours. However, we may need to do some stoping and restarting along the way as we run into problems and tweaking to get the right parameters. Assuming the resource we use has access to the data in /projsmall, we can keep all files within our previously purchased storage. If we need temporary local storage, our current tests have shown an expansion of the data from a few gigabytes to 100 Gb within two or three days of starting the assembly. We expect a max of no more than 300Gb to be produced for each project.
Title: Solving Large Quadratic Assignment Problems #2
Applicant: mjs
Email: mjs@clemson.edu
Principal Investigator: Matthew Saltzman
Department: Mathematical Sciences
Address: Martin Hall, Box 340975
Phone: 656-3185
PI Email: mjs@clemson.edu
Request Class: Research and Discovery
Request Type: CONTINUATION
Request Submit Date: 2009/03/27
Desired Begin Date: 2009-04-01
Desired End Date: 2009-04-30
Resource: CCMS
Node Hours: 720
Nodes Required: 1
Storage: 200 RAM
Other User IDs: Peter Hahn, University of Pennsylvania (userid phahn2127)
STATUS
Note: Data that needs to be stored off aris
[phahn2127@ccmslogin ~/qapprob]$ du level3_root
15883330 level3_root/size25/nug25_with_SA
449 level3_root/size25/test
15883452 level3_root/size25/nug25
14 level3_root/size25/chr25a/with_sim_anneal
19 level3_root/size25/chr25a/wo_sim_aneal
40 level3_root/size25/chr25a
15883409 level3_root/size25/tai25a
183802258 level3_root/size25
102 level3_root/size26
10 level3_root/size12/makedir
260 level3_root/size12
183802622 level3_root
Description: The first allocation achieved a substantial closing of the gap between the
best known lower bound and the best known upper bound. However, the computation did not complete.
At the current rate of closure, we estimate that an additional 30 CPU days will be needed to
achieve the best possible lower bound.
Prior Work: See above for results of part 1 of this request.
Title: Solving Large Quadratic Assignment Problems #1
Applicant: mjs
Email: mjs@clemson.edu
Principal Investigator: Matthew Saltzman
Department: Mathematical Sciences
Address: Martin Hall, Box 340975
Phone: 656-3185
PI Email: mjs@clemson.edu
Request Class: Research and Discovery
Request Type: NEW
Request Submit Date: 2009/02/03
Desired Begin Date: 2009/02/05
Desired End Date: 2009/06/30
Resource: CCMS
Node Hours: 300
Nodes Required: 1
Storage: 200 RAM
Other User IDs: Peter Hahn, University of Pennsylvania, hahn@seas.upenn.edu
STATUS
Description: The quadratic assignment problem (QAP) is a well-known optimization problem with applications in areas such as facility layout, VLSI chip design, routing, and others. The problem is a member of the broader class of nonlinear integer programming problems, and is notorious for its difficulty--the largest instance in a well-known test set that has been solved to date involves matching only 30 objects. An important approach to solving integer programs is to construct continuous linear relaxations and solve these to get bounds for an implicit enumeration of the solution space. One particular strategy for the construction of linear relaxations constructs a hierarchy of successively tighter representations, culminating in a description of the convex hull of feasible integer solutions. Solving this problem as a linear program would yield the optimal integer solution. Unfortunately, the size of such a representation grows exponentially with the number of objects to be assigned, so cannot be solved in practice. A compromise approach is to generate a lower-level representation and use the solution as above to bound a search of the solution space. Higher-order relaxations produce progressively tighter bounds, but require progressively more memory. The proposed project involves generating and solving higher-order relaxations than have been used previously and using the results to solve larger instances of QAP than have been solved previously. Our current software is a serial FORTRAN code that solves the specially structured linear program for this level of relaxation for QAP. We project that it will require 300 CPU hours and 200 GBytes of RAM (not disk) to solve the next problem in the standard test set. Future work involves re-implementing the code to exploit multicore parallelism in the solution of the structured linear program. That project will require a single node with multiple cores and similarly large amounts of RAM. In the longer term, the methods described here can be applied to more general discrete optimization problems.