Superoptimiser pseudo-weeknotes
Good meeting with my supervisor (Des Watson) on campus today, discussing my dissertation: which is to produce a superoptimiser for the Java Virtual Machine, both to evaluate the approach in the context of VMs (no-one's looked at this yet, as far as I can see) and to look at how practically useful it could be. It's very early days - technically the first week I've worked on it, though I've done a few days here and there to distract myself from revision.
I'm close to having a toy implementation of a working superoptimiser for Java bytecode (though there's a lot to do once that's ready), and I've been working through literature this last week. Tom Crick did some work on using Answer Set Programming to generate formally proven optimal code, and I've revisited an old (and probably classic, by now) Tanenbaum paper on peephole optimisation of intermediate code. The latter is quite relevant because it proposed an architecture which is at least superficially similar to the JVM.
I've also dipped into the world of formal proofs, courtesy of an email chat with Bernhard Reus: Abstract Interpretation, static program analysis, formal verification of machine code programs and Model Checking. Here I'm less confident: a lot of the work seems quite theoretical, that which is practical is not authoritative (understandably so given the undecidability of some key underlying problems), and if it's approximate, then what's the advantage of formal proof over oodles of statistical testing?
Des and I chatted a bit about test data and the need to avoid inadvertently producing algorithms which are tied to closely to test data; he warned me (again) about the complexities that branching code will bring to both equivalence testing and pruning candidate sequences; and we talked about minimal looping routines that could be usefully tested: one to sum the elements of an array, then find min/max elements, and then to sort. I'm not sure if I'll get onto them. My feeling is that I'll manage to get to about 6-7 length sequences in the course of this project, but without a bolt of enlightenment will struggle to go further. Currently a sequence 3 long involves considering 278,614 candidate sequences; one 4 long produces so many that I got bored of waiting for my laptop to count them ;)
I've floated the idea of reverse-engineering algorithms from optimal sequences generated; we reckoned that to do anything useful with this, you'd need 12-16 long sequences of instructions, which feels a way off. Still, Des seemed super-excited and I left the meeting buzzing.
As an aside, this time of year my weekly jaunt onto the green and wooded landscape of Sussex University is an absolute pleasure...