| Kestrel Navigation |
|---|
| Background | Papers | Kestrel Images | People | Home |
Architects and industry have been searching for the next durable computational model, the next step beyond the standard CPU. Graphics co-processors, though ubiquitous and powerful, can only be effectively used on a limited range of stream-based applications. The UCSC Kestrel parallel processor is part of a continuum of parallel processing architectures, stretching from the application-specific through the application-specialized to the application-unspecific. Kestrel combines an ALU, multiplier, and local memory, with Systolic Shared Registers for seamless merging of communication and computation, and an innovative condition stack for rapid conditionals. The result has been a readily programmable and efficient co-processor for a wide range of applications, including biological sequence analysis, image processing, and irregular problems. Experience with Kestrel indicates that programmable systolic processing, and its natural combination with the Single Instruction-Multiple Data (SIMD) parallel architecture, is the most powerful, flexible, and power-efficient computational model available for large group of applications. Unlike other approaches that try to displace or replace the standard serial processor, our model recognizes that the expansion in the application landscape and performance requirements simply imply that the most efficient solution is the combination of more than one type of processor. We propose a model in which the CPU and the GPU are complemented by ``the third big chip'', a massively-parallel SIMD processor.
The UCSC Kestrel parallel processor is part of an evolution from application-specific to specialized to applicationunspecific processing. Kestrel combines an ALU, multiplier, and local memory, with Systolic Shared Registers for seamless merging of communication and computation, and an innovative condition stack for rapid conditionals. The result has been a readily programmable and efficient co-processor for many applications. Experience with Kestrel indicates that programmable systolic processing, and its natural combination with the Single Instruction-Multiple Data (SIMD) parallel architecture, will be an effective design choice for years to come.
Background: Viral capsids, bacterial toxins, alkaline solutions, and other toxins lyse cells by membrane poration driven colloid-osmotic lysis, which has been studied in detail by averaging over cell populations. However, no quantitative data is available for the changes in the shape and size of individual cells during membrane poration driven colloid-osmotic lysis, or for the relationship between cell shape and fragility.
Methods: Hydroxide, a porating agent, was generated in a microfluidic enclosure containing red blood cells (RBCs) in suspension. Automatic cell recognition, tracking and morphometric measurements were done using a custom image analysis program. Cell area and circular shape factor (CSF) were measured over time for individual cells. Implementations were developed both in MATLAB and on Kestrel, a parallel computer which affords higher speed, approaching real time processing.
Results: The average CSF goes through a first period of fast increase, corresponding to the conversion of discocytes to spherocytes under internal osmotic pressure, followed by another period of slow increase until the fast lysis event. For individual cells, the minimum CSF is shown to be inversely correlated with cell lifetime (linear regression factor R = 0.44), with discocytes surviving longer than spherocytes. The inflated cell volume to surface area ratio is also positively correlated with lifetime (R = 0.43), while it is uncorrelated with the CSF. Lifetime is best correlated to the ratio of cell inflation volume (Vfinal - Vinitial) to surface area (R = 0.65).
Conclusions: RBCs are shown to inflate at a rate proportional to their surface area, in agreement with a constant flux model, and lyse after attaining a spherical morphology. Spherical RBCs display increased alkaline hemolysis fragility (shorter lifetimes), providing an explanation for the increased osmotic fragility of RBCs from patients suffering from spherocytosis.
Hopfield neural networks are often used to solve difficult combinatorial optimization problems. Multiple restarts versions find better solutions but are slow on serial computers. Here, we study two parallel implementations on SIMD computers of multiple restarts Hopfield networks for solving the maximum clique problem.
The first one is a fine-grained implementation on the Kestrel Parallel Processor, a linear SIMD array designed and built the University of California, Santa Cruz. The second one is an implementation on the MasPar MP-2 according to the ``SIMD Phase Programming Model'', a new method to solve asynchronous, irregular problems on SIMD machines. We find that the neural networks map well to the parallel architectures and afford substantial speedups with respect to the serial program, without sacrificing solution quality.
The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems to dedicated hardware implementations of algorithms. Single purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance.
The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focusses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications.
The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This paper presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz.
This paper presents a novel heuristic for graph coloring that works on a range of colors and iteratively tries to make this range more compact. This range-compaction heuristic also has a ``pressure'' component and an annealing schedule for it. The value of this component is empirically quantified. This algorithm is evaluated on a wide range of DIMACS benchmark graphs, and found to be competitive with state-of-the-art algorithms in terms of solution quality and run time.
We describe an approach to optimization based on a multiple-restart quasi-Hopfield network where the only problem-specific knowledge is embedded in the energy function that the algorithm tries to minimize. We apply this method to three different variants of the graph coloring problem: the minimum coloring problem, the spanning subgraph k-coloring problem, and the induced subgraph k-coloring problem.
Though Hopfield networks have been applied in the past to the minimum coloring problem, our encoding is more natural and compact than almost all previous ones. In particular we use k-state neurons while almost all previous approaches use binary neurons. This reduces the number of connections in the network from $(Nk)^2$ to $N^2$ asymptotically and also circumvents a problem in earlier approaches, that of multiple colors being assigned to a single vertex. Experimental results show that our approach compares favorably with other algorithms, even non-neural ones specifically developed for the graph coloring problem.
Computer aided sequence analysis is a critical aspect of current biological research. Sequence information from the genome sequencing projects fills databases so quickly that humans cannot examine it all. Hence there is a heavy reliance on computer algorithms to point out the few important nuggets for human examination. Sequence search algorithms range from simple to complex, as does the representation of the biological data. Typically though, simple algorithms are used on the simplest of data representations because of the large computational demands of anything more complex. This leads to missed hits because the simple search techniques are often not sufficiently sensitive.
Here we describe the implementation of several sensitive sequence analysis algorithms on the Kestrel parallel processor, a single-instruction multiple-data (SIMD) processor developed and built at UCSC. Performance of the Smith-Waterman and Hidden Markov Model algorithms, with both Viterbi and Expectation Maximization methods ranges from 6 to 20 times faster than standard computers.
In combinatorial library design and use, the conformation space of molecules can be represented using three-dimensional (3-D) pharmacophores. For large libraries of flexible molecules, the calculation of these 3-D pharmacophoric fingerprints can require examination of trillions of pharmacophores, presenting a significant practical challenge. Here we describe the mapping of this problem to the UCSC Kestrel parallel processor, a single-instruction multiple-data (SIMD) processor. Data parallelism is achieved by simultaneous processing of multiple conformations and by careful representation of the fingerprint structure in the array. The resulting application achieved a 35+ speedup over an SGI 2000 processor on the prototype Kestrel board.
Hopfield neural networks are often used to solve difficult combinatorial optimization problems. Multiple restarts versions find better solutions but are slow on serial computers. Here, we study two parallel implementations on SIMD computers of multiple restarts Hopfield networks for solving the maximum clique problem. The first one is a fine-grained implementation on the Kestrel Parallel Processor, a linear SIMD array at the University of California, Santa Cruz. The second one is an implementation on the MasPar MP-2 according to the ``SIMD Phase Programming Model'', a new method to solve asynchronous, irregular problems on SIMD machines. We find that the neural networks map well to the parallel architectures and afford substantial speedups with respect to the serial program, without sacrificing solution quality.
This paper presents the "SIMD Phase Programming Model", a simple approach to solving asynchronous, irregular problems on massively parallel SIMD computers. The novelty of this model consists of a simple, clear method on how to turn a general serial program into an explicitly parallel one for a SIMD machine, transferring a portion of the flow control into the single PEs. Three case studies (the Mandelbrot Set, the N-Queen problem, and a Hopfield neural network that approximates the maximum clique in a graph) will be presented, implemented on two different SIMD computers (the UCSC Kestrel and the MasPar MP-2). Our results so far show good performance with respect to conventional serial CPU computing time and in terms of the high parallel speedup and efficiency achieved.
Dynamic programming is the core algorithm of sequence comparison, alignment and linear hidden Markov model (HMM) training. For a pair of sequence lengths m and n, the problem can be solved readily in O(mn) time and O(mn) space. The checkpoint algorithm introduced by Grice, Hughey, and Speck (1997) runs in O(Lmn) time and O(Lmn^(1/L)) space, where L is a positive integer determined by m, n, and the amount of available workspace. The algorithm is appropriate for many string comparison problems, including all-paths and single-best-path hidden Markov model training, and is readily parallelizable. The checkpoint algorithm has a diagonal version that can solve the single-best-path alignment problem in O(mn) time and O(m+n) space.
In this work, we improve performance by analyzing optimal checkpoint placement. The improved row checkpoint algorithm performs up to one half the computation of the original algorithm. The improved diagonal checkpoint algorithm performs up to 35\% fewer computational steps than the original. We modified the SAM hidden Markov modeling package to use the improved row checkpoint algorithm. For a fixed sequence length, the new version is up to 33% faster for all-paths and 56% faster for single-best-path HMM training, depending on sequence length and allocated memory. Over a typical set of protein sequence lengths, the improvement is 10%.
Kestrel is a massively parallel coprocessor system designed to accelerate biological sequence analysis while providing general-purpose computing capabilities. The system, implemented on a PCI card, consists of 512 SIMD, 8-bit processing elements (PEs) connected in a linear array that communicate through shared-register banks. The PE array is implemented as eight 64-PE, 1.4 million transistor chips implemented in the HP $0.5\,\mu$m technology. An array controller provides global instruction sequencing and manages data movement between the array and host machine, and the host machine's software controls board operations and provides system access to users at remote workstations. Many people contributed to the design and implementation of the Kestrel system, so this document focuses on the author's contributions while including enough information to provide relatively complete hardware documentation. The author's contributions include the physical layout design of most of the processing element, layout simulations, Kestrel board and array controller debugging, the runtime environment's program-execution core, and integration of all the various hardware and software components to form a working system.
The UCSC Kestrel Project has built a single-board programmable parallel processor with 512 processing elements (PEs). The fully operational system features a dense 1.4 million transistor chip composed of 64 byte-wide PEs, each with its own memory. The system architecture was designed specifically to support a large variety of computational biology algorithms, for which programmability was essential, but large memories and arbitrary communication networks were not. The Kestrel system performs Smith \& Waterman and hidden Markov model biological sequence searches 20 times faster than a 433-MHz DEC Alpha, and is 35 times faster than one processor of an SGI Origin 2000 on a computational chemistry application.
Kestrel is a high-performance programmable parallel co-processor. Its design is the result of examination and reexamination of algorithmic, architectural, packaging, and silicon design issues, and the interrelations between them. The final system features a linear array of 8-bit processing elements, each with local memory, an arithmetic logic unit (ALU), a multiplier, and other functional units. Sixty-four Kestrel processing elements fit in a 1.4 million transistor, 60mm^2, 0.5 micron CMOS chip with just 84 pins. The planned single-board, 8-chip system will, for some applications, provide supercomputer performance at a fraction of the cost. This paper surveys four of our applications (sequence analysis, neural networks, image compression, and floating-point arithmetic), and discusses the philosophy behind many of the design decisions. We present the processing element and system architectures, emphasizing the ALU and comparator's compact instruction encoding and design, the architecture's facility with nested conditionals, and the multiplier's flexibility in performing multiprecision operations. Finally, we discuss the implementation and performance of the Kestrel test chips.
This paper presents a study of multiprecision division on processors containing word-by-word multipliers. It compares common division algorithms by optimizing each algorithm for the software environment and then comparing their performance on simple machine models. While the study was originally motivated by floating-point division in the small-word environment, the results are extended to multiprecision floating-point and integer division in general. Two algorithms are found to be best for multiprecision division. A hybrid of the Newton-Raphson and Byte Division algorithms is optimal when the divisor is small and (to a lessor degree) for many other division problems. Restoring Division is quite competitive or optimal in other cases and is particularly easy to implement. The asymptotic costs for floating-point division of each of these algorithms is the same as that of multiprecision multiplication.
This paper significantly expands upon the results of Multiprecision division on an 8-bit processor, 13th IEEE Symposium on Computer Arithmetic (Arith-13), IEEE CS Press, pp. 74-81, 1997.
Kestrel is a programmable linear systolic array processor designed for sequence analysis. Among other features, Kestrel includes an 8-bit word, a single-cycle add-and-minimize instruction, and efficient communication using Systolic Shared Registers. This paper describes Kestrel's functional units in detail, and examines each of their effects on system performance. With prototypes currently in the works, we expect to complete a full Kestrel array, with between 512 and 1024 processing elements, in 1997.
Sequence alignment is the problem of finding the optimal character-by-character correspondence between two sequences. It can be readily solved in O(n^2) time and O(n^2) space on a serial machine, or in O(n) time with O(n) space per P=O(n) processing elements on a parallel machine. Hirschberg's divide-and-conquer approach for finding the single best path reduces space use by a factor of n while inducing only a small constant slowdown to the serial version. This paper presents a family of methods for computing sequence alignments with reduced memory that are well-suited to serial or parallel implementation. Unlike the divide and conquer approach, they can be used in the forward-backward (Baum-Welch) training of linear hidden Markov models, and they avoid data-dependent repartitioning, making them easier to parallelize. A single best path member of this algorithm family matches the quadratic time and linear space of the divide-and-conquer algorithm. Experimentally, the O(n^{1.5})-space member of the family is 15--40\% faster than the O(n)-space divide-and-conquer algorithm.
Sequence comparison, a vital research tool in computational biology, is based on a simple O(n^2) algorithm that easily maps to a linear array of processors. This paper reviews and compares commercial and academic high-performance sequence analysis on general-purpose supercomputers and single-purpose, reconfigurable, and programmable co-processors. The difficulty of comparing hardware from published performance figures is also noted.
Sequence comparison with affine gap costs is a problem that is readily parallelizable on simple single-instruction, multiple-data stream (SIMD) parallel processors using only constant space per processing element. Unfortunately, the twin problem of sequence alignment, finding the optimal character-by-character correspondence between two sequences, is more complicated. While the innovative O(n^2)-time and O(n)-space serial algorithm has been parallelized for multiple-instruction, multiple-data stream (MIMD) computers with only a communication-time slowdown, typically O(log n), it is not suitable for hardware-efficient SIMD parallel processors with only local communication. This paper proposes several methods of computing sequence alignments with limited memory per processing element. The algorithms are also well-suited to serial implementation. The simpler algorithms feature, for an arbitrary integer L, a factor of L slowdown in exchange for reducing space requirements from O(n) to O(L^(1/n)) per processing element. Using this result, we describe an O(n log n) parallel time algorithm that requires O(log n) space per processing element on O(n) SIMD processing elements with only a mesh or linear interconnection network.
Sequence comparison, a vital research tool in computational biology, is based on a simple O(n^2) algorithm that easily maps to a linear array of processors. This paper reviews and compares high-performance sequence analysis on general-purpose supercomputers and single-purpose, reconfigurable, and programmable co-processors. The difficulty of comparing hardware from published performance figures is also noted.
This paper presents an architecture for programmable systolic arrays that provides simple and efficient systolic communication. The Brown Systolic Array is a linear implementation of this Systolic Shared Register architecture; a working 470-processor prototype system performs 108 MOPS. A 32-chip, 1504-processor implementation could provide 5 GOPS of systolic co-processing power on a single board.
This paper presents the New Systolic Language as a general solution to the problem systolic programming. The language provides a simple programming interface for systolic algorithms suitable for different hardware platforms and software simulators. The New Systolic Language hides the details and potential hazards of inter-processor communication, allowing data flow only via abstract systolic data streams. Data flows and systolic cell programs for the co-processor are integrated with host functions, enabling a single file to specify a complete systolic program.
| Kestrel Home | Computational Biology | School of Engineering | UCSC |