Friday, June 20, 2008

a dependency analysis framework for Cabal

My GSoC project consists in the realization of a dependency analysis framework for Cabal so that it can handle more complex packages like gtk2hs and in general provide features and fix bugs that require more knowledge about the sources than what Cabal has.

The framework, given a set of make-like rules and targets, must be able to build the targets performing the minimal set of actions, but giving the same result as if everything was rebuilt from scratch, and exploit parallelism if requested (like make -j).

To be able to reuse values from previous builds, thus avoiding unnecessary actions, we need to make sure that they are not invalidated from external modifications, so we track every variable parameter that can influence the outcome of an action as a dependency, if those haven't changed we can skip running the action and reuse the cached value.

This information is represented with a dependency graph, which is a DAG where there are edges to each node from its dependencies, unfortunately we can't create it statically and then perform the actions, since many dependencies can be extracted only after some intermediate values are computed, or, put in another way, the existence of an edge can depend on the value of a node, which could be represented as an edge between the two, giving an impredicative definition for the set of edges, and now you know where the title of this blog comes from.[1]

An example of this behaviour is easy to find: producing A.hi from A.hs requires reading the interface files of the modules imported by A, so you must first read A.hs o know which they are and add them as dependecies for A.hi. And it's not always a matter of a simple read, in the presence of preprocessors you may have to generate A.hs by running the action of another rule.


As of today i've implemented the core algorithm that interleaves building the graph from the rules and a top-down traversal that follows the various indipendent subtrees in parallel.
This way at each step we have multiple actions to run indipendently if there are cpu-cores to exploit and then we use the results to expand the graph before recursing.
At this point we can't yet share the generated graph between builds, only the values produced, but we can already perform some tests in a fake IO monad that logs the side-effects and everything seems to work fine. The trace is quite long so you can find it here.

A thing to note is that the algorithm uses an abstract notion of target so everything can be included as a dependency, from the existence of files to the result of "cabal configure" or the flags passed to it. For example we could use this to make "cabal install" skip the configure and build step if they aren't necessary. You only need to write enough detailed rules, and one of the aims is to make this easier by designing a set of combinators if not a fully EDSL, since the representation of the rules used by the graph is not exactly high level.

All of the work is done on this branch of the cabal repo:
darcs get --partial http://code.haskell.org/~Saizan/cabal

Any comments obviously welcomed :)


[1] More precisely from a comment by edwardk on #haskell when i asked for such a structure.

2 comments:

droundy said...

You might enjoy taking a look at franchise, which solves basically the same problems you describe. It's not terribly elegant, but looking at its approach might be informative. It does do dependency analysis (calling ghc to get the dependencies) and running in parallel.

darcs get http://darcs.net/repos/franchise

David

cheezychunk said...

For Cabal Online Cheats, Cabal Online Dupes, Cabal Online Alz cheats, Cabal Online Bots, Cabal Online Guides, and Walkthroughs click here