Even if this blog has been silent since the first post, I've completed my gsoc project, producing a branch of Cabal that supports preprocessor chaining, incremental and parallel builds, and yhc.
Deciding how to best handle preprocessor chaining raised a lot of problems: gnu make handle this with suffix rules, but in our model we've abstracted out the concept of filenames, we only have opaque targets.
In the end we took the simplest approach: we represent a preprocessor as a matcher on suffixes in a way that we can translate a list of them down to the rules of our model, sort of like a compiler implements new user-level features by translating them into its core language.
Now the problem is how to integrate this framework in the mainline Cabal. It's quite easy to just use it for preprocessors and yhc (my branch does exactly this), but in the end we'd like to move most of the IO code in Cabal to our model so that users don't need to "cabal clean" from time to time to get consistent results.
We'd also like to provide a very high-level edsl that avoids the risks of wrong declarations of targets and dependencies, solving "what's wrong with make".
The developement could continue inside of Cabal, taking the burden of dealing with old code, limited available packages, and portability between compilers that doing so requires, but we preferred to spawn a separate project with the aim of replacing the building code in Cabal when we're sufficiently confident about it.
So now we've yet another haskell build tool, hbuild!
The first release should come very soon, since the first milestone corresponds to the work i've done this summer.
Monday, September 8, 2008
Friday, June 20, 2008
a dependency analysis framework for Cabal
My GSoC project consists in the realization of a dependency analysis framework for Cabal so that it can handle more complex packages like gtk2hs and in general provide features and fix bugs that require more knowledge about the sources than what Cabal has.
The framework, given a set of make-like rules and targets, must be able to build the targets performing the minimal set of actions, but giving the same result as if everything was rebuilt from scratch, and exploit parallelism if requested (like make -j).
To be able to reuse values from previous builds, thus avoiding unnecessary actions, we need to make sure that they are not invalidated from external modifications, so we track every variable parameter that can influence the outcome of an action as a dependency, if those haven't changed we can skip running the action and reuse the cached value.
This information is represented with a dependency graph, which is a DAG where there are edges to each node from its dependencies, unfortunately we can't create it statically and then perform the actions, since many dependencies can be extracted only after some intermediate values are computed, or, put in another way, the existence of an edge can depend on the value of a node, which could be represented as an edge between the two, giving an impredicative definition for the set of edges, and now you know where the title of this blog comes from.[1]
An example of this behaviour is easy to find: producing A.hi from A.hs requires reading the interface files of the modules imported by A, so you must first read A.hs o know which they are and add them as dependecies for A.hi. And it's not always a matter of a simple read, in the presence of preprocessors you may have to generate A.hs by running the action of another rule.
As of today i've implemented the core algorithm that interleaves building the graph from the rules and a top-down traversal that follows the various indipendent subtrees in parallel.
This way at each step we have multiple actions to run indipendently if there are cpu-cores to exploit and then we use the results to expand the graph before recursing.
At this point we can't yet share the generated graph between builds, only the values produced, but we can already perform some tests in a fake IO monad that logs the side-effects and everything seems to work fine. The trace is quite long so you can find it here.
A thing to note is that the algorithm uses an abstract notion of target so everything can be included as a dependency, from the existence of files to the result of "cabal configure" or the flags passed to it. For example we could use this to make "cabal install" skip the configure and build step if they aren't necessary. You only need to write enough detailed rules, and one of the aims is to make this easier by designing a set of combinators if not a fully EDSL, since the representation of the rules used by the graph is not exactly high level.
All of the work is done on this branch of the cabal repo:
darcs get --partial http://code.haskell.org/~Saizan/cabal
Any comments obviously welcomed :)
[1] More precisely from a comment by edwardk on #haskell when i asked for such a structure.
The framework, given a set of make-like rules and targets, must be able to build the targets performing the minimal set of actions, but giving the same result as if everything was rebuilt from scratch, and exploit parallelism if requested (like make -j).
To be able to reuse values from previous builds, thus avoiding unnecessary actions, we need to make sure that they are not invalidated from external modifications, so we track every variable parameter that can influence the outcome of an action as a dependency, if those haven't changed we can skip running the action and reuse the cached value.
This information is represented with a dependency graph, which is a DAG where there are edges to each node from its dependencies, unfortunately we can't create it statically and then perform the actions, since many dependencies can be extracted only after some intermediate values are computed, or, put in another way, the existence of an edge can depend on the value of a node, which could be represented as an edge between the two, giving an impredicative definition for the set of edges, and now you know where the title of this blog comes from.[1]
An example of this behaviour is easy to find: producing A.hi from A.hs requires reading the interface files of the modules imported by A, so you must first read A.hs o know which they are and add them as dependecies for A.hi. And it's not always a matter of a simple read, in the presence of preprocessors you may have to generate A.hs by running the action of another rule.
As of today i've implemented the core algorithm that interleaves building the graph from the rules and a top-down traversal that follows the various indipendent subtrees in parallel.
This way at each step we have multiple actions to run indipendently if there are cpu-cores to exploit and then we use the results to expand the graph before recursing.
At this point we can't yet share the generated graph between builds, only the values produced, but we can already perform some tests in a fake IO monad that logs the side-effects and everything seems to work fine. The trace is quite long so you can find it here.
A thing to note is that the algorithm uses an abstract notion of target so everything can be included as a dependency, from the existence of files to the result of "cabal configure" or the flags passed to it. For example we could use this to make "cabal install" skip the configure and build step if they aren't necessary. You only need to write enough detailed rules, and one of the aims is to make this easier by designing a set of combinators if not a fully EDSL, since the representation of the rules used by the graph is not exactly high level.
All of the work is done on this branch of the cabal repo:
darcs get --partial http://code.haskell.org/~Saizan/cabal
Any comments obviously welcomed :)
[1] More precisely from a comment by edwardk on #haskell when i asked for such a structure.
Subscribe to:
Posts (Atom)