Dennis, a professor of computer science at New York University, is the author of The Puzzling Adventures of Dr. Ecco (Dover, 1998); Codes, Puzzles, and Conspiracy (W.H. Freeman & Co., 1992); Database Tuning: A Principled Approach (Prentice Hall, 1992), (coauthored with Jason Wang and Bruce Shapiro) Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications (Oxford University Press, 1999); and (coauthored with Cathy Lazere) Out of Their Minds: The Lives and Discoveries of 15 Great Computer Scientists (Springer Verlag, 1998). He can be contacted at [email protected].
He meant to look like a professorial colleague, but something about his sense of urgency suggested he was really a spy.
"We have to find the sequence of this microvirus extremely fast, Dr. Ecco," the bearded biochemist told us, never bothering to adjust the asymmetric placement of the glasses on his face. He introduced himself as Bill Smith and said he worked for 'the government.'
"I thought there were machines for that," Ecco replied.
"Well, yes, but this sequence comes to us from an agent in a certain, well, not so friendly, developing country," Smith replied. "We have only the data I'm going to give you, not the sequence itself. This virus can kill and appears to be contagious, partly because of its small size. Our agent reports many deaths at the factory producing the virus. If released, it could cause an untold number of deaths."
"Sounds bad, What do you know?" Ecco said.
"Well, we know about two of its sequences," said Smith, clearly in a rush. "To analyze them, our agent used three restriction enzymes. He then separated them using agarose gel electrophoresis..."
Smith paused and looked at our blank faces. Liane had taken to cutting a spiral in a returned homework paper.
"I see that you're not following," he said. "Let me start more slowly. It's just that there is so little time. You know that DNA can be viewed as a sequence of As, Cs, Ts, and Gs. It exists most commonly in the form of a double-helix, but because one side of the helix determines the order of the other side, we will pay attention only to one side and pretend we have just a single strand consisting of half of the helix. The sequence is read left to right from what is known as the 5' (read 5 prime) end towards the 3' end.
"Certain kinds of proteins called 'restriction enzymes' cut the sequence at particular points. In our case, we have three, which we'll call r1, r2, and r3. Restriction enzyme r1 will cut a sequence at CT and split C from T. Similarly, r2 will cut at AG and r3 will cut at TT. After one or more restriction enzymes perform cuts, the ministrands that result can be ordered by size. When they are small enough, the ministrands can be sequenced directly. Given the sizes after each cut and the sequences of the final ministrands, we must figure out the sequence of the whole strand.
"Let me give you an example. Suppose we have the sequence CATTCTGTATA. If we cut it with r1 alone, we will split the sequence into CATTC and TGTATA; sizes 5 and 6. If we cut with r2 alone, no split will occur. If we cut with r3 alone, we will get CAT and TCTGTATA, of sizes 3 and 8. If we cut with both r1 and r3, we will get: CAT, TC, TGTATA.
"Of course, if the agent's data told us the sequences and their orders, our problem would be over. Unfortunately, all we know is their sizes and the contents of the final arrangement. So, revisiting the example, imagine that I told you only the following: Splitting with r1 gave sizes 5, 6; splitting with r2 gives no split splitting with r3 gives sizes 3 and 8 splitting with r1, r2, and r3 gives three cuts (which I'm deliberately writing in order of size, because that's the order in which all these results are given): TC, CAT, TGTATA. What would you do?"
Liane had stopped cutting spirals and was listening intently.
Smith continued, "To infer the order, note that none of our enzymes cuts at AT, AC, or CC junctions. So, TC can bind only to TGTATA (which is also consistent with the fact that r3 splits the sequence into lengths 3 and 8). Further, TGTATA must be on the right because it cannot bind to anything. This gives us a single possible ordering CAT, TC, TGTATA."
"Okay, that's clear," 12-year old Liane said, eyes sparkling. "Tell us what we need to know for your super-dangerous microvirus."
"Here is the data about the smaller sequence," Smith said. "The restriction enzymes cut as stated before (r1 cuts at CT, r2 cuts at AG, and r3 cuts at TT).
What Sizes of Cuts
Cuts (in order of size)
r1 2 2 2 7 9 11 17
r2 2 4 4 11 29
r3 1 49
r1 r2 2 2 2 2 2 2 4 7 7 9 11
r1 r3 1 2 2 2 6 9 11 17
r2 r3 1 2 4 4 10 29
"Finally, the last thing we know is the collection of sequences in order of size after being cut by all three restriction enzymes:
T
TC
TC
TC
TA
GA
GC
GATA
TCCATC
GTCCGTC
TCACACGGC
TCGCACACGGA
"Our question to you now is, 'What is the entire sequence in its natural order?' If you can't be completely precise, we understand, but the more you can tell us, the easier it will be for us to find an anti-body."
"What do the repeats, like the ones for TC, mean?" Liane asked.
"According to our agent, they mean that TC appears three separate times in the sequence," Smith replied.
Reader: Give it a try. Liane started out using scissors to cut out the subsequences after being cut by all three restriction enzymes.
"That wasn't too hard," Liane said after 20 minutes, handing her solution to Smith. "Give us the other."
"I had heard that you were clever," Smith said after verifying the result. "Now I know it's true.
"The second sequence is quite a bit longer:"
What Sizes of
Cuts Cuts
r1 6 12 21 29 39
r2 3 4 6 11 11 12 13 19 28
r3 2 5 6 6 7 10 13 16 19 23
r1 r2 2 2 3 4 5 6 7 11 11 12 12 13 19
r1 r3 1 1 2 5 5 6 6 7 9 9 13 14 14 15
r2 r3 2 2 2 2 3 3 3 4 4 6 6 7 8 8 10 11 12 14
"The result of all three restriction enzymes gives cuts that look like this:
T
T
GT
GT
GT
GC
TA
TT
GGA
GGT
TTA
TACA
TGGAA
TGGTGT
GGCGGA
GGACGTC
TCGTATA
TATGCGAA
TTACATGTC
TATCGCGAC
TACGGCCCCGA
GCCGCGATCCAT
Reader: Try to find this longer sequence. If there are ambiguities in the order, say so.
Last Month's Solution
Imagine that each person maintains 100 buckets, one for each number of centimillions between $100 million and $10 billion. He or she will receive numbers from other people and put them into buckets. Here's how.
Each person gives 100 numbers between 0 and 20 inclusive to each other person. The values n1ij, n2ij, ... , n100ij are the 100 numbers i gives to j. n1ij goes into bucket 1 at j's site, n2ij goes into bucket 2 at j's site, and so on. Each person receives 100 numbers from himself too. If person i has M centimillion dollars, he will make the sum over j of nMij be 1 mod 21. Thus, the contribution that i makes to bucket M at all sites sums to 1 mod 21. For all L unequal to M, i will make the sum over j of nLij be 0 mod 21. Each person j will then sum the numbers in bucket 1 mod 21, sum the numbers in bucket 2 and then take the modulus of 21, and so on. A further sum will be made for all the totals of bucket 1 at all sites and again a modulus of 21 operation will be applied to the result. Similarly for buckets 2 through 100. This second sum can all be done at a single site, assuming honesty.
Imagine that the total for a bucket q is 1. That means that exactly one person has q centimillion dollars, since all other contributions will cancel out. If the total is 8, say, then 8 people have q centimillion dollars.
There is no possibility of collusion unless everyone gangs up on you, because that is what it takes for them to infer the contribution you have made to each bucket.
Reader Notes
For a game with such "Simple" (DDJ, March 2000) rules, only a few hardy readers attempted a solution. The two-player game was this: The first player places Xs and the second player places Os on an infinite board. The goal is to get four in a row either vertically or horizontally (diagonally doesn't count). The first player has a clear advantage. The question is: If we make it a rule that the second player wins if he can prevent the first one from winning in 10 moves, then does either side have a winning strategy?
Greg Smith was the first to show that the first player could win and Tomas G. Rokicki was the first to show that eight moves was required. As of this writing it is open whether the first player can win if the first player must always place an X next to an existing X or existing O, except on the very first move.
Ted Alper showed that there is no winning strategy for five in a row. He used a clever tessellation:
A B C C
A B D D
E E G H
F F G H
"That is, two adjacent squares are adjoined to form a 'domino'...and two adjacent dominos share the same orientation (whether horizontal or vertical), but the next two dominos in any direction from them will have the OTHER orientation. Any winning position of length 5 must have 2 squares from the same domino. When the first player takes any square, the O player takes the remaining square," thus ensuring eternal frustration to the X player. Other innovative solutions came from Peter Ivanyi and Robert Morrison.
DDJ