Pipelines in Python

Generators (PEP 255 "Simple Generators") and Coroutines (PEP 342 "Coroutines via Enhanced Generators") are the cleanest way I've come across so far to implement the concept of a "pipeline" in Python.

First approximation

A pipeline is made of:

a Producer, that generates data;
many /Stage/s, that receive data from the previous stage and send it to the next;
a Consumer, that receives data from the last stage.

The producer is a coroutine that only send/s data, generated internally from some initial state. /Stage/s are coroutines that both receive and send messages. The /consumer only receives data. Chaining is done in function pipeline: each argument but the last is instantiated with an instance of the next stage. The full pipeline is started by issuing a next (or send(None)) to the Producer.

In the following example, a stream of integers is produced and pushed down the pipeline: each stage adds 1 and finally the result is printed in the consumer.

Wrapping it up

A pattern emerges, so we'd better wrap it up in a class. Moreover, let's split the "architecture" of the pipeline from the behavior of each stage.

More useful example

As a more interesting application, here is how to use a pipeline to implement a simple crawler, to download links from http://news.ycombinator.com/ and find all the posts where the word "Python" is mentioned.

Cleaning things up

Things are still far from clean and bulletproof. One step in the right direction is to follow the suggestions found in David Beazley's presentation on coroutines.

The previous examples is by no means "production ready", but maybe someone will find some good idea to apply to real world problems.