Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at [email protected].
It so often happens: you make the interesting discoveries when you're trying to get something else done. I was planning to write about a fantastic thing I'd coded up which proxied Google, presenting the results a bit more nicely, keeping a record of which results people clicked on for particular searches, doing various domain-specific disambiguation to determine whether "Jaguar" was a car, an animal or an operating system, and so on.
Unfortunately, I never got it finished before the vacation, where I don't really have enough bandwidth to get it tested. So to try to alleviate the problem, I wrote another kind of proxyone to store every kind of request and response it sees, and then play back the responses to the same requests again later. This means that whenever I am connected to the Internet, I can throw out a few requests, collect the results in a database, and bring them back to the machine I'm developing on.
Now this might sound like what the browser's cache does, or something we could use Squid to achieve, and it is a little like that, but it has three useful properties: first, it's rude. It doesn't care about cache directives which ask the cache to fetch the page again if it has expired. Once it's stored a page, it'll give it to you again if you request the same page, no matter how old the cached version or how dynamic the page ought to be. This means it works nicely for completely disconnected operation.
Second, it stores every kind of request and response. Browser caches typically don't cache any pages where there are POST requests sending data to the server; my proxy does. Finally, it's portable. I can move a single database file around between different machines, and I have my snapshot of the Internet in my pocket.
Of course this is not only useful for the kind of development that I'm doing, but it's also useful for module testing. For instance, if you're writing a module which accesses something on the web, you might find it useful to ship a database of known-good data to test from, both so that your module can be tested in situations where the end-user is currently disconnected from the Internet and also so that, in situations such as testing an interface to a search engine, the tests can be protected from the highly dynamic and changing nature of likely result sets.
A Proxy Primer
Let's remind ourselves how proxies work in general, before we pick up the Perl tools to help us write our storage and replay proxies.
In the normal case of affairs, a web browser puts together a HTTP request and sends it to a web server. The server responds with a HTTP request. Both messages have headers (for instance, saying when the page was generated, what type of data it is, and so on) and a body, the contents of the web page or any POSTed form data.
When a proxy gets involved, the browser sends the request to the proxy instead of to the remote server; the proxy might decide to respond to it itself, or it might pass on the request to the web server as before. The proxy will rewrite some of the headers, and may choose to mess with the body if it wants to. The proxy then receives the response from the server, modifies it if it wants to, and finally passes it back to the client. That might sounds like a lot of work, but we have CPAN!
There are two major ways to write web proxies in Perl using CPAN
modules: first, we can use the HTTP::Proxy
module, which basically
does everything for us, or, if we're writing more complicated proxies,
we can spin our own proxy together using POE
and the
POE::Component::Server::HTTP
and POE::Component::Client::HTTP
modules. HTTP::Proxy
is much simpler, so we'll begin with that.
A Dummy Proxy
The simplest proxy is one which does nothing at all to interfere with the request/response cycle. It just passes on the request to the server, and passes the response back to the client. Such a simple proxy can be useful if, for instance, you have a network of computers which is disconnected from the Internet apart from one gateway machine. You don't want to allow complete Internet access, but you do want the computers on the network to access the web. The solution is to get the gateway to act as a web proxy. The computers on the private network connect to the gateway, and the gateway can connect to the outside world.
Here's how to write such a gateway proxy in HTTP::Proxy
:
use HTTP::Proxy; HTTP::Proxy->new( host => "10.0.0.2" )->start;
This will start a HTTP proxy on port 8080
on the internal IP address
10.0.0.2
. It will forward HTTP connections to the relevant server on
the outside world, and then pass back the response to the client.
HTTP::Proxy
also allows us to attach filters onto the stages of
operation of this basic proxy: to mess with the headers and body sent to
the remote server, and to mess with the headers and body of the response
from it. For instance, here's an example from the HTTP::Proxy
documentation which removes various headers which might give away
information about the browser:
$proxy->push_filter( mime => undef, request => HTTP::Proxy::HeaderFilter::simple->new( sub { $_[0]->remove_header(qw( User-Agent From Referer Cookie )) }, ), response => HTTP::Proxy::HeaderFilter::simple->new( sub { $_[0]->remove_header(qw( Set-Cookie )); }, ) );
This says that we want to filter all MIME types, rather than the default
text/*
, and that we should construct a filter to go onto the request
side of proxying which removes the User-Agent
, From
, Referer
and Cookie
headers before it goes on to the remote server, and that
responses coming back from the server should have the Set-Cookie
header stripped.
The Store Proxy
For the first of our matched pair of proxies, we don't want to change
the request or the response, but we do want to store away the response
when we see it. HTTP::Proxy
generally sends data to the body filters
in chunks as it arrives, but we want to wait until the full response has
been received before doing anything. We do this by pushing the
HTTP::Proxy::BodyFilter::complete
filter onto the response stack:
$proxy->push_filter( mime => undef, response => HTTP::Proxy::BodyFilter::complete->new, );
The next filter we're going to push on will serialize the request and the response.
use DB_File; use Storable qw(dclone freeze); $proxy->push_filter( mime => undef, response => HTTP::Proxy::BodyFilter::simple->new(sub { return unless $proxy->response; my $request = dclone($proxy->request); $request->headers->remove_header($_) for qw/user-agent accept accept-language accept-charset x-forwarded-for via/; tie my %clicked, "DB_File", "cache.db"; $clicked{freeze($request)} = freeze($proxy->response); untie %clicked; }) );
This begins by making a copy of the request, and removing some of the
headers which are incidental to the request. We want any other requests
we make to the same URL with the same data in the body to look the
same as the current request object, so we get rid of all the headers
which would make it distinctive. This means when we freeze the request
with Storable::freeze
we can use it as a hash key, and freezing
another request like it will come to the same hash key. Similarly, we
freeze the response object so that we can retrieve it later; using a
Berkeley DB means that we have a file we can move between machines
easily.
The Replay Proxy
The replay proxy is very similar. We need to use the same Berkeley database:
use DB_File; tie my %clicked, "DB_File", "cache.db";
We need to be able to both freeze and thaw objects: to freeze the request into a hash key, and to thaw the response from out of the hash again.
use Storable qw(freeze thaw dclone);
So when the request comes in to the proxy, we want to look at it and see
if we've seen it before. This will be a body filter, because we want to wait until the whole request is available:
$proxy->push_filter( mime => undef, request => HTTP::Proxy::BodyFilter::simple->new( sub {
Our filter needs to do the same thing to the request as it did in the store filter:
my $request = dclone($proxy->request); $request->headers->remove_header($_) for qw/user-agent accept accept-language accept-charset x-forwarded-for via/;
And now, if we've seen this filter before, we can retrieve the response and return it immediately:
return unless my $response = $clicked{freeze($request)}; $proxy->response(thaw($response)); }) );
If we don't set response
in a filterthat is, if we don't find the
request in the databasethen the request carries on to the remote
server as normal. Of course, where we're disconnected, this will return
an error, but it does enable us to intercept particular requests, which
is what we wanted all along.
Conclusion
With this pair of proxies in place, we can run the store proxy on a machine which is directly connected to the Internet, store all our test data into a database, and then take the database home to an unconnected machine. From there we can do our development, hitting the same sites and getting the same responses as though we were connected, and ensure that our module gives the results we want. This works just as well for testing web-based modules.
Next time I'll be taking the technological temperature of the Perl community by reporting what's hot and popular at YAPC::Europe in Braga, and then (I hope) we'll look at using Perl to make Google a bit smarter.
TPJ