This first item is for those of you who lust after a Lisp workstation. You know who you are. Others can skip ahead to this month's excursion into superlinear speedup in parallel algorithms or my radical proposal regarding the teaching of programming.
Early this year, Apple acquired the assets of Coral Software, a Cambridge, Mass., software company specializing in programming languages and artificial intelligence tools. Five Coral engineers have been installed as the core of a Cambridge-based research lab for Larry Tesler's Advanced Technology Group.
Apart from the questions that this acquisition raises or puts to rest regarding the kinds of software that Apple finds it appropriate to sell, it focuses attention on the Coral's product line, and particularly on its Allegro Common Lisp. Coral has been selling a Lisp for the Macintosh since early on, and currently has three language products: Allegro Common Lisp, the entry-level Pearl Lisp, and Object Logo. Although Logo is often regarded as a language for children, Object Logo reputedly offers nearly complete access to the toolbox and is surprisingly powerful. But it is Allegro Common Lisp that Apple is initially distributing through the recently-ingested APDA.
Franz Inc. contributed to the original development of Allegro Common Lisp, and will continue to offer a version of the product for the Mac running under A/UX. Because Franz is the company supplying the Lisp implementation being bundled with the NeXT computer, one expects smooth portage of Lisp between these environments. Because Mathematica is also available on both machines, it looks as though certain kinds of scientific programming could live comfortably on either machine. Common Lisp itself was designed for portability, as Guy Steele points out in The Book, Common Lisp: The Language.
I have discussed superlinearity in this column before (August 1988). A recent paper by V. Nageshwara Rao and Vipin Kumar casts more light on the phenomenon.
In a parallel architecture, you hope to get greater processing speed from adding more processors, distributing a problem across them so that they can attack it in parallel. The more processors, the greater the speed, it is fair to expect. If you can reduce sequential running time by a factor of n through using n processors, you are getting a linear speedup, and that's very good.
Linearity is, in fact, a kind of logical limit on the speedup you can expect from parallelizing a problem. It's clear why this should be so: If a parallel algorithm can solve a problem in n seconds using m processors, there is a sequential algorithm that can solve the problem in m*n seconds, namely, the algorithm that consists of the m components of the parallel algorithm, executed sequentially by the single processor. In practice, full linearity is rarely achieved, because there is always some overhead involved in distributing the problem to the processors and combining their outputs.
According to this logic, superlinearity ought not to be possible. If you can speed up an algorithm by a factor greater than m by parallelizing it across m processors, it would seem that you just didn't have the fastest sequential algorithm to begin with. And it would seem that it would be trivial to construct a better sequential algorithm from the parallel one, as described in the preceding paragraph.
That's why the existence of superlinear speedup in a number of parallel algorithms is surprising.
Typically, when apparent superlinearity is encountered in the parallelization of an algorithm, one of two situations exists. In one situation, the original sequential algorithm can be shown to have been improved in parallelizing it, so that resequentializing it yields a better sequential algorithm and eliminates the superlinearity. In the other situation, the speedup is superlinear in isolated cases, but still linear or sublinear on the average. The former is useful if it really leads to improving the sequential algorithm, and this has occurred.
The latter is not uncommon, and is real superlinearity, in a sense. This kind of superlinearity is not so surprising once you examine it. After all, it's like a local decrease in entropy: it's more than offset by increases elsewhere in the system. And while this isolated-case superlinearity may be useful (especially if you have some knowledge of the kinds of cases you are likely to encounter in practice), it is more than offset in the long run by the sublinear cases, resulting in a sublinear average speedup, which is what we are led to expect by the logical argument against superlinearity.
Unfortunately for the argument, there are cases in which it is possible to achieve an average speedup that is superlinear.
Rao and Kumar have found, for state-space search problems, superlinearity on the order of a 17-fold speedup with nine processors, on the average.
The kind of problem to which Rao and Kumar address themselves fits a simple model of depth-first search of a complete binary tree of m leaf nodes, with an equal static partitioning of the tree among n processors.
Depth-first search is a fundamental AI algorithm, used when the problem can be cast as "the search for a path in a directed graph from an initial node to a goal node.... [It] is used quite frequently if good problem-specific evaluation functions are not available." (I'm quoting Stuart Shapiro from page 1004 of his Encyclopedia of Artificial Intelligence, Volume 2, John Wiley & Sons, 1987.) The goal is to find a solution node in a tree, and the basic algorithm, adapted from Shapiro, looks like this:
That is, the algorithm searches all the way down the leftmost branch, then backs up to the first unsearched branch and searches it, until it finds a solution node or runs out of tree.
Solution nodes, in Rao and Kumar's model, are uniformly distributed in a randomly located region, possibly over the whole tree (solutions live at the leaf nodes). The assumption regarding the distribution of solutions is a good one for problems for which no good heuristic exists. (If there is a good heuristic, you won't even be able to get linearity, because the heuristic will reorder the branches, driving solutions to the left part of the tree, where a sequential depth-first algorithm will find them quickly.) The algorithm (sequential or parallel) stops when it finds the first solution.
The parallel algorithm Rao and Kumar use slices the tree vertically into n subtrees, and runs the sequential algorithm on each, in parallel on n processors.
In this model, parallel depth-first search is essentially linear if there is only one solution to find, and also linear if there are many solutions distributed uniformly over the entire tree (more precisely, over the entire set of leaf nodes).
Superlinearity enters the picture when there are several solutions and the (unknown) region in which they lie is smaller than the entire tree. Maximal superlinearity occurs when the region in which solutions lie is equal to the size of the partition of the search space allocated to a processor; i.e., m/n leaf nodes. If s represents the number of solutions, the maximum efficiency, or degree of superlinearity, is given by:
or for a large number of processors, essentially
Theoretically, that can be a very large number, but even the more modest 17/9 speedup the authors actually found is impressive.
This superlinearity seems to be the result of the imbalance in the solution density over the entire solution space, which lets one of the processors get a sizable chunk of the region in which the solutions lie. This gives that processor an excellent chance to find a solution, and, because the first solution found ends the search, this gives the parallel algorithm a superlinear edge over the identical sequential algorithm.
Rao and Kumar think that there ought to be many real-world cases in which this kind of superlinearity can be expected. There are many problems in which a "good" non-goal node often leads to many solutions, while "bad" non-goal nodes lead to no solutions. The solutions headed by a "good" node will lie close together, leading to the kind of restricted solution region the authors have modelled. They cite the Eight Queens problem as an example, and their experiments focused on the 15 Puzzle and the Hacker's problem.
Personally, I still think superlinearity looks like magic. I can't help a touch of skepticism about these results. The superlinearity effect Rao and Kumar describe rests on certain information about the distribution of solutions, namely that the solutions are clustered in a region smaller than the entire search space. Because I don't see that information being used by the sequential depth-first algorithm, I can't help wondering if the deck isn't stacked against it, and if there isn't some sequential algorithm that incorporates this information and brings depth-first search back to linear reality. I dunno.
I'd like to advance a radical proposal: That programming paradigms be used as an entry into the study of computer science and programming. I'd like to see programming paradigms studied in interdisciplinary courses that serve as the student's first and only course in computer science. I'd also like to see the study of paradigms introduced into the core computer science curriculum at the very beginning of the computer science student's coursework.
I'm not saying that I don't believe there are any prerequisites for the study of computer science. The student needs some elementary mathematical knowledge that any eighth grader ought to have, but that colleges often have to provide. And a computer science program within an EE department will naturally have elementary electronics and computer hardware prerequisites. For such a program, that's certainly the right way to begin. But there is always a first course in programming, and that's the course I have in mind. That course marks the student's first practical experience with controlling the machine.
I don't want to overstate the case. Those of us who learned Fortran or Basic during that impressionable phase were probably not ruined forever by the experience. But when we learned to program in Fortran or Basic, we were also learning that what we were doing was programming. Our later experiences with such alternative paradigms as functional programming, object-oriented programming, or logic programming were confusing and frustrating, and at first we found ourselves resisting the experience. We were unlearning what programming is, and learning a new, broader idea of programming that encompassed the new paradigm. We thought that programming involved implementing good algorithms to solve problems, but discovered that sometimes it's selecting representative classes to model the system. We thought that programming was all about writing code to manipulate data, but we had to accept that sometimes the distinction between code and data just got in the way. Well, I don't want ever to find myself resisting learning something new; I want to be open to new ideas. I believe that learning about programming in the usual way leads to such resistance, by implanting fixed notions about what programming is, notions that are just plain wrong.
What would I cover in this imaginary Computer Science 101? I can only list some elements that seem appropriate; I don't have a syllabus. I'd present many example programs for the students to read and study; I'd ask the students to interpret the code, explaining what they think it does. There would be no keying and running of programs. I'd present alternative paradigms, identified as such, and get the students to see the differences in approach of the different paradigms. Ideally, by the end of the course, the students would have some ability to criticize code in several languages from the point of view of clarity and readability (though not efficiency). I could see the course being titled Computer Programming for Reading Knowledge.
There are several obvious arguments against using such a course in programming paradigms as the introduction to programming.
One runs like this: 1. The differences among programming paradigms only make sense after you've acquired one paradigm. Students needs at least one concrete example of what programming is before they can deal with abstractions about what programming can be. Programming is an abstract activity anyway, and it will confuse students to introduce it at an abstract level, offering alternative visions of the activity from the start.
Or there's this argument: 2. A course such as I describe will leave the student still unable to write a computer program. What I propose is tantamount to requiring a course in linguistics as prerequisite to learning a foreign language. Learning a new language --natural or computer --is difficult, and the student ought to have a chance to come out of the first encounter with a feeling of success.
Or this: 3. Students of programming may need help learning the first language, but after that, they can teach themselves. We need programmers. Many students will take no course after the first. In the traditional scheme, these people will be equipped to go forth and program, despite the fact that they'll have much to learn on the job. Students who have been subjected to my proposed Computer Science 101 will not be so equipped.
I have an answer for each.
Weeks of student time in introductory programming courses is taken up in identifying and correcting syntax errors, honing typing skills, letting the problems that happen to arise dictate what the student happens to learn. These things have to be dealt with eventually if the student is to learn to write correct programs, but they are getting in the way of getting the idea. Those who remember that far back should know that first programming courses don't give a sense of success; they give a sense of accomplishment, acquired from finally breaking through the brick wall you'd been beating your head against. I propose sending the student to the wall equipped with an intellectual hammer.
V. Nageshwara Rao and Vipin Kumar, "Superlinear Speedup in Parallel State-Space Search," in Foundations of Software Technology and Theoretical Computer Science, Nori, K.V. and Kumar, S, eds., Springer-Verlag, 1988.
Shapiro, Stuart, Encyclopedia of Artificial Intelligence, Volume 2, John Wiley & Sons, 1987.
Steele, Guy, Jr., Common Lisp: The Language, Digital Press, 1984.
Swaine, Michael, "Programming Paradigms," Dr. Dobb's Journal, (August, 1988).
Copyright © 1989, Dr. Dobb's JournalSuperlinearity Revisited
Put the initial node on a list called OPEN.
While OPEN is nonempty:
Remove the first node from OPEN and call it n.
If n is a solution, return n and quit.
Generate the immediate successors of n.
Put them at the beginning of OPEN.
End while.
(s+1)/2-(s-1)/(2*n),
(s+1)/2.
Paradigms for Beginners
A Bad Introduction May Mark the Student For Life
1. I don't buy this. I've never had a college physics course, yet I have read and understood, for recreation, dozens of books on relativity, cosmology, and quantum physics. I think I have a sense of the issues, a feeling for what it is like to work in the fields, and if I were to go back to school to study physics, I think I could choose a field of study more wisely for having read these books. While there are difficult abstractions in programming, the basic concepts of computer programming itself are nowhere near as ethereal as those of quantum physics. People who know nothing about programming think that it means writing instructions to make the computer do what they want it to do, and that's a workable first approximation.
2. I guess I don't hold the ability to write syntactically-correct Basic code to be at college-level skill. I also don't think that it is a particularly important skill for the great majority of people who have some interest in computer science, but who will never become professional software developers. The ability to read a block of code is a very different skill, is useful to more people, and is a skill that professional programmers need but are never explicitly taught. I do believe that it is possible to teach a basic reading knowledge of several programming languages, exemplifying radically different paradigms within the course of a college term. I don't mean proofreading or finding errors in code; I mean discerning what a correct and well-written page of code does. My CS 101 would not be like a linguistics course; it would be a course in what computer programming is. There's no analog in natural languages because we don't need a course telling us what a natural language is.
3. Students who are introduced to programming in the way I propose will have to take programming courses or deal with some frustration in learning the mechanics of programming by themselves. I think that a basic reading knowledge of several programming languages will seriously reduce the frustration of learning where to put the semicolons, but in any case, the students will be coming to the process of writing code with a broader understanding. It's possible that the approach I propose would reduce the number of poorly-educated people passing themselves off as programmers. I guess that's a drawback if you think that bad software is better than no software at all.
References