09 October 2011

Loop counting: Google's multi-language-bench

C++ is not a cool language. I think it once was, back in the early 1990s; certainly there were a lot of books and magazine articles written about it at that time. Back then I wrote code mostly in C. I think it took quite a while before I really got object orientation, but eventually I became comfortable with C++, probably around the time that Java became the new cool language.

I mention this because it made me smile when a little while ago a friend showed me a paper (pdf) from Google that described how they implemented the same algorithm in several different languages and found the one written in C++ the fastest. You know how brand loyalty is apparently tied up with our self-esteem? Hey, C++ may not be cool, but when you got important things you gotta do fast you need a tool man-enough for the job, I thought. Of course, benchmarks are notoriously unreliable, and I didn't really believe this. But it still made me smile.

Maybe I shouldn't let my self-esteem get entangled with a programming language. But I think it's not unnatural that after spending considerable time with a language one may become attached to it. I'd like to state that I am aware of this proverb: if all you have is a hammer, everything looks like a nail; I subscribe to the notion that learning an unfamiliar programming language may broaden your mind. A little. Which is a good thing.

Reading the Google paper I came across this
“We find that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.”
They have a high opinion of themselves don’t they? Doubtless it’s justified. But it sounds like a challenge...

One of the promises of object oriented design is that you may be able to change the implementation of an object without changing its interface. So I wondered if I could improve on the performance of the Google C++ implementation without changing the test driver. After about 30 seconds of reading the code I decided that that sounded too much like hard work: much more fun to continue in the great programming tradition and rewrite the code. (The Google code has objects with getters that return pointers to private member variables; this restricts the freedom to change the implementation. Breaking encapsulation may be justified for run-time performance gains. The penalty is that a change to the implementation may mean a change to the user code, or a run-time penalty to maintain the old interface.)

So anyway, I wrote the version shown below. To make execution time comparisons meaningful I tried to keep the functionality identical to the Google C++ implementation from Robert Hundt's paper. Russel Cox also implemented the algorithm. So when I talk about Robert's code or Russel's code these are what I'm referring to.

antloops.h
#ifndef ANTLOOPS_H_INCLUDED
#define ANTLOOPS_H_INCLUDED
/* This code is an implementation of the analyze_loops algorithm from
Nesting of Reducible and Irriducible Loops by Paul Havlak published
in ACM ToPLaS, July 1997. (Referred to below as "Havlak's paper.")
See http://howtowriteaprogram.blogspot.com/

Copyright (c) 2011 Anthony C. Hay. Some Rights Reserved.
Published under the BSD 2-clause license.
See http://www.opensource.org/licenses/BSD-2-Clause
*/

#include <vector>
#include <stdexcept>


namespace anthay {

typedef int block_name;

/* a basic_block is a node or vertex in a control-flow graph
(a basic block is a sequence of zero or more statements in a computer
program that would be executed in-sequence with no jumps into or
out of the sequence, described briefly in Havlak's paper; it doesn't
matter, for our purposes it's just a node in a graph) */
class basic_block {
public:
void add_out_edge(block_name to)
{
out_edges_.push_back(to);
}

void add_in_edge(block_name from)
{
in_edges_.push_back(from);
}

typedef std::vector<block_name> edge_vec_t;
const edge_vec_t & out_edges() const
{
return out_edges_;
}
const edge_vec_t & in_edges() const
{
return in_edges_;
}

private:
edge_vec_t out_edges_;
edge_vec_t in_edges_;
};


// represent one control-flow graph
class mao_cfg {
public:
mao_cfg()
: start_node_(undefined)
{
}

void reserve(int size)
{
node_vec_.reserve(size);
}

// return a pointer to a basic_block with the given 'name';
// pointer is guaranteed valid only until next call to any non-const function
basic_block * create_node(block_name name)
{
const int size = static_cast<int>(node_vec_.size());
if (size == 0)
start_node_ = name; // assume first created block is graph start
if (name >= size)
node_vec_.resize(name + 1);
return &node_vec_[name];
}

void create_edge(block_name from, block_name to)
{
create_node(from)->add_out_edge(to);
create_node(to)->add_in_edge(from);
}

// return the name of the start node
block_name start_node() const
{
if (start_node_ == undefined)
throw std::runtime_error("undefined start node");
return start_node_;
}

// return the number of nodes in the graph
int size() const
{
return node_vec_.size();
}

// return a reference to the node with the given 'name';
// reference guaranteed valid only until next call to any non-const function
const basic_block & node(block_name name) const
{
return node_vec_[name];
}

private:
enum { undefined = -1 };
block_name start_node_;
typedef std::vector<basic_block> node_vec_t;
node_vec_t node_vec_;
};


class loop_structure_graph {
public:
loop_structure_graph() : count_(0) {}
void clear() { count_ = 0; }
void add_loop() { ++count_; }
int size() const { return count_; }

void calculate_nesting_level()
{
/* I'm not implementing this because it has no effect on
the performance of either the Robert Hundt or Russel Cox
C++ implementations of the Havlak algorithm. Robert's
LoopStructureGraph::CalculateNestingLevel() is never
called. Russel's LoopGraph::CalculateNesting() is called,
but as Loop::isRoot is not initialised (and is therefore
mostly "true"), it does (almost) nothing. */
}

private:
int count_;
};


// find loops in given 'cfg' using the analyze_loops algorithm from Havlak's
// paper; return the number of loops found
int find_havlak_loops(const mao_cfg & cfg, loop_structure_graph & lsg);

}
#endif


antloops.cpp
/*  This code is an implementation of the analyze_loops algorithm from
Nesting of Reducible and Irriducible Loops by Paul Havlak published
in ACM ToPLaS, July 1997. (Referred to below as "Havlak's paper.")
See http://howtowriteaprogram.blogspot.com/

Copyright (c) 2011 Anthony C. Hay. Some Rights Reserved.
Published under the BSD 2-clause license.
See http://www.opensource.org/licenses/BSD-2-Clause
*/

#include "antloops.h"
#include <algorithm>

namespace anthay {
namespace {

typedef int preorder;


// implements the DFS algorithm from Havlak's paper
class depth_first_search {
public:
depth_first_search(const mao_cfg & cfg)
: cfg_(&cfg), current_preorder_(0),
node_(cfg.size(), unvisited), number_(cfg.size(), unvisited), last_(cfg.size())
{
dfs(cfg.start_node());
}

// return the preorder number of the last visited node, or -1
// if no nodes were visited
preorder last_preorder() const
{
return current_preorder_ - 1;
}

// return true iff 'p' is set to the preorder number assigned to node
// with the given name 'a'; false means there was no path to 'a'
bool preorder_number(block_name a, preorder & p) const
{
p = number_[a];
return p != unvisited;
}

// return the block name with the given preorder number 'p';
// 'p' _must_ be in the range 0 .. last_preorder()
block_name block(preorder p) const
{
return node_[p];
}

// return true iff node 'w' is an ancestor of node 'v', where
// w and v are the preorder numbers of the nodes
bool is_ancestor(preorder w, preorder v) const
{
return w <= v && v <= last_[w];
}

private:
const mao_cfg * cfg_;
preorder current_preorder_; // used in dfs() to assign a preorder number to each node visited
std::vector<block_name> node_; // node_[a] is the block_name of node with preorder number a
std::vector<preorder> number_; // number_[a] is the depth-first (or preorder) number of node named a
std::vector<preorder> last_; // last_[number_[a]] is the last decendant of node a

enum { unvisited = -1 };

// recursive depth-first search; the DFS algorithm from Havlak
void dfs(block_name a)
{
node_[current_preorder_] = a;
number_[a] = current_preorder_++;
// foreach node b such that there exists an edge (a,b) do...
const basic_block::edge_vec_t & out_edges = cfg_->node(a).out_edges();
for (unsigned i = 0; i < out_edges.size(); ++i)
if (number_[out_edges[i]] == unvisited)
dfs(out_edges[i]);
last_[number_[a]] = current_preorder_ - 1;
}
};


// implements the UNION/FIND algorithm required by Havlak's algorithm
class union_find {
public:
// create 'size' elements named 0 .. size-1, with no connections between elements
explicit union_find(int size)
: element_(size)
{
for (int i = 0; i < size; ++i)
element_[i] = i;
}

// return the name of the root element of the element named 'p'
// (beneficial side-effect: the path from 'p' to the root is compressed)
int find_root(int p)
{
// I found this a good place to learn about the UNION/FIND algorithm:
// http://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf
while (p != element_[p])
p = element_[p] = element_[element_[p]];
return p;
}

// merge the set containing element 'a' into the set containing element 'b'
void unite(int a, int b)
{
const int ra = find_root(a);
element_[ra] = find_root(b);
}

private:
std::vector<preorder> element_;
};


// return true iff given 'v' contains given element 'elem'
inline bool is_member(const std::vector<preorder> & v, preorder elem)
{
return std::find(v.begin(), v.end(), elem) != v.end();
}

const int max_non_back_preds = 32 * 1024;

}// anonymous namespace



// find loops in given 'cfg' using the analyze_loops algorithm from Havlak's
// paper; record loops in given 'lsg' and return the number of loops found
int find_havlak_loops(const mao_cfg & cfg, loop_structure_graph & lsg)
{
if (cfg.size() == 0)
return 0;

// step a:
const depth_first_search dfs(cfg);

// step b:
enum node_type { non_header, self, reducible, irreducible };
// type[p] is node_type of node with preorder number p
std::vector<int> type(dfs.last_preorder() + 1, non_header);
std::vector<preorder> header(cfg.size(), cfg.start_node());
std::vector<std::vector<preorder> > back_preds(cfg.size()); // backedge predecessors
std::vector<std::vector<preorder> > non_back_preds(cfg.size()); // non-backedge predecessors

for (preorder w = 0; w <= dfs.last_preorder(); ++w) {
const basic_block::edge_vec_t & in_edges = cfg.node(dfs.block(w)).in_edges();
for (unsigned j = 0; j < in_edges.size(); ++j) {
preorder v;
if (!dfs.preorder_number(in_edges[j], v))
continue; // in_edges[j] was not visited
if (dfs.is_ancestor(w, v))
back_preds[w].push_back(v);
else if (!is_member(non_back_preds[w], v))
non_back_preds[w].push_back(v);
}
}
header[cfg.start_node()] = 0;

// step c:
union_find preorder_set(dfs.last_preorder() + 1);
std::vector<preorder> p;
for (preorder w = dfs.last_preorder(); w >= 0; --w) {
p.clear();

// step d:
const std::vector<preorder> & back_preds_w = back_preds[w];
typedef std::vector<preorder>::const_iterator preorder_iter;
for (preorder_iter v = back_preds_w.begin(); v != back_preds_w.end(); ++v) {
if (*v != w)
p.push_back(preorder_set.find_root(*v));
else
type[w] = self;
}
if (!p.empty()) {
type[w] = reducible;
for (unsigned ix = 0; ix != p.size(); ++ix) {
const preorder x = p[ix];

// step e:
const std::vector<preorder> & non_back_preds_set = non_back_preds[x];
if (non_back_preds_set.size() > max_non_back_preds) {
// algorithm has degenerated, aparently
lsg.clear();
return 0;
}
for (preorder_iter y = non_back_preds_set.begin(); y != non_back_preds_set.end(); ++y) {
const preorder yprime = preorder_set.find_root(*y);
if (!dfs.is_ancestor(w, yprime)) {
type[w] = irreducible;
if (!is_member(non_back_preds[w], yprime))
non_back_preds[w].push_back(yprime);
}
else if (yprime != w && !is_member(p, yprime))
p.push_back(yprime);
}
}
}

if (!p.empty() || type[w] == self) {
for (preorder_iter x = p.begin(); x != p.end(); ++x) {
header[*x] = w;
preorder_set.unite(*x, w);
}
lsg.add_loop();
}
}

return lsg.size();
}


}


antloops_test.cpp
/*  This code is an implementation of the test driver in the paper
Loop Recognition in C++/Java/Go/Scala by Robert Hundt.
See http://code.google.com/p/multi-language-bench/

Copyright (c) 2011 Anthony C. Hay. Some Rights Reserved.
Published under the BSD 2-clause license.
See http://www.opensource.org/licenses/BSD-2-Clause
See also http://howtowriteaprogram.blogspot.com/
*/

#include "antloops.h"
using namespace anthay;

#include <iostream>
#include <cassert>


// create a diamond from 'start' to two intermediate nodes back to one
// end node; return the name (or index) of the end node
int build_diamond(mao_cfg & cfg, int start)
{
cfg.create_edge(start, start + 1);
cfg.create_edge(start, start + 2);
cfg.create_edge(start + 1, start + 3);
cfg.create_edge(start + 2, start + 3);
return start + 3;
}


// when n > 0, create a daisy chain of nodes containing 'n' edges:
// start -> start+1 -> start+2 -> ... -> start+n
// return the name of the last node in the chain
int build_straight(mao_cfg & cfg, int start, int n)
{
for (int i = 0; i < n; ++i)
cfg.create_edge(start + i, start + i + 1);
return start + n;
}


// create two diamonds in series, add three further edges to introduce
// 3 loops; return the name of the last node in the created graph
int build_base_loop(mao_cfg & cfg, int from)
{
int header = build_straight(cfg, from, 1);
int diamond1 = build_diamond(cfg, header);
int d11 = build_straight(cfg, diamond1, 1);
int diamond2 = build_diamond(cfg, d11);
int footer = build_straight(cfg, diamond2, 1);
cfg.create_edge(diamond2, d11);
cfg.create_edge(diamond1, header);
cfg.create_edge(footer, from);
footer = build_straight(cfg, footer, 1);
return footer;
}


int main()
{
// build a control flow graph and count loops like the code
// in Robert Hundt's LoopTesterApp.cpp so that execution time
// comparisons with that code are meaningful
std::cerr << "counting loops...";

mao_cfg cfg;
cfg.create_node(0);
build_base_loop(cfg, 0);
cfg.create_node(1);
cfg.create_edge(0, 2);

for (int i = 0; i < 15000; ++i) {
loop_structure_graph lsg;
find_havlak_loops(cfg, lsg);
}

int n = 2;
for (int parlooptrees = 0; parlooptrees < 10; parlooptrees++) {
cfg.create_edge(2, ++n);

for (int i = 0; i < 100; i++) {
const int top = n;
n = build_straight(cfg, n, 1);
for (int j = 0; j < 25; j++)
n = build_base_loop(cfg, n);
const int bottom = build_straight(cfg, n, 1);
cfg.create_edge(n, top);
n = bottom;
}
cfg.create_edge(n, 1);
}

loop_structure_graph lsg;
const int num_loops = find_havlak_loops(cfg, lsg);
std::cerr << "\nnum_loops = " << num_loops << "\n";
assert(num_loops == 76001);

for (int i = 0; i < 50; i++) {
loop_structure_graph lsg;
std::cerr << ".";
find_havlak_loops(cfg, lsg);
}

lsg.calculate_nesting_level();
}


(I found I needed to reserve about 10 MB stack space to get Robert's code to produce a result. You may need to do the same for mine.)

Probably the most interesting difference between the implementations is that Robert's and Russel's both use explicit news and deletes and mine does not. I don't mean to say that there is anything wrong with raw pointers - I use them all the time - I simply chose a different approach to the problem using vectors and indices instead. The fact that the nodes are named with successive integers and may be referenced by a 'preorder' number, an integer, suggested a vector would be an obvious data structure to use.

Using collections, such as vectors, can make the code simpler, less likely to leak memory and easier to make exception-safe.

I think the code is in the spirit of the Google paper, which was that it should be written using "the default, idiomatic data structures in each language, as well as default type systems, memory allocation schemes, and default iteration constructs the basic facilities provided in the language."

Is it correct?


If it was not a requirement to be correct we could give an answer to any problem in an arbitrarily short time. So let's ask first if the code gives the right answer.

My code calculates there are 76,001 loops in the test graph. Robert's gives 76,002 and Russel's gives 76,000. Oh dear. Which is right?

Playing with Robert's code I found that it always exaggerates the number of loops by exactly one; for example it returns 1 for a graph containing no loops at all. So adjusting for this means mine and Robert's agree.

I corresponded briefly with Russel and he was kind enough to respond to my questions. It turns out there is a slight difference between Russel's graph and Robert's in the first few nodes: in Robert's there is an edge connecting nodes 1 -> 2, in Russel's there isn't. This difference means there is one less loop in Russel's graph, so his result is correct for his graph.

Looking at the code that builds the graph I would have said there were ((((3 x 25) + 1) x 100) + 1) x 10 = 76,010. But I'd be wrong, probably because overlapping loops are not counted more than once, so the 'and 10' becomes 'and 1'.

In short, I believe there are 76,001 loops in the Robert Hundt C++ graph and my implementation counts them correctly.

I like to remember I have an amazing symbol manipulation tool at my command that I can use to help me check my own work. (Not my own brain, lol.) I mean I may be able to write some code to test my code. Since I have the two Google implementations to test against the obvious thing to do is to build some graphs and compare the loop counts returned by each implementation. I wrapped each in its own namespace and linked all three into one test program.

The first idea I had was to generate graphs containing a random number of nodes with a random number of edges between random nodes, like this

int bigrand()
{
return rand() * rand();
}

// generate three representations of the same random graph
void generate_random_graph(
rhundt::MaoCFG & rhcfg,
russcox::CFG & russcfg,
anthay::mao_cfg & antcfg,
int num_nodes,
int num_edges)
{
// create all nodes; nodes are named 0..num_nodes-1
for (block_name b = 0; b < num_nodes; ++b) {
rhcfg.CreateNode(b);
russcfg.NewBlock();
antcfg.create_node(b);
}

// link pairs of nodes chosen at random
for (int i = 0; i < num_edges; ++i) {
const block_name a = bigrand() % num_nodes;
const block_name b = bigrand() % num_nodes;
// we don't care if we link a node to itself
// or duplicate an existing link

new rhundt::BasicBlockEdge(&rhcfg, a, b);
russcfg.Connect(russcfg.block[a], russcfg.block[b]);
antcfg.create_edge(a, b);
}
}

// create a random graph and count the loops using all three implementations
void test_graph(int num_nodes, int num_edges)
{
rhundt::MaoCFG rhcfg;
russcox::CFG russcfg;
anthay::mao_cfg antcfg;

generate_random_graph(rhcfg, russcfg, antcfg, num_nodes, num_edges);
std::cerr
<< "nodes " << std::setw(6) << num_nodes
<< " edges " << std::setw(6) << num_edges
<< " loops ";

anthay::loop_structure_graph antlsg;
const int ant_loops = anthay::find_havlak_loops(antcfg, antlsg);
std::cerr << std::setw(5) << ant_loops;

russcox::LoopGraph russlsg;
russcox::LoopFinder f;
f.FindLoops(&russcfg, &russlsg);
const int rcox_loops = russlsg.loop.size();
std::cerr << " " << std::setw(5) << rcox_loops;

rhundt::LoopStructureGraph rhlsg;
// (to make consistent with others shift result form 1..n to 0..n-1)
const int rhundt_loops = rhundt::FindHavlakLoops(&rhcfg, &rhlsg) - 1;
std::cerr << " " << std::setw(5) << rhundt_loops;

if (rhundt_loops != ant_loops || rcox_loops != ant_loops)
std::cerr << " <=== disagreement";

std::cerr << "\n";
}

int main()
{
const int max_graph_size = 10000;

srand(20111002);
for (;;)
test_graph((bigrand() % max_graph_size) + 1, bigrand() % max_graph_size);
}



Here is the output from a typical run

nodes   8834 edges   1940 loops     0     0     0
nodes 3112 edges 5351 loops 379 379 379
nodes 6031 edges 5518 loops 0 0 0
nodes 7929 edges 2756 loops 0 0 0
nodes 3841 edges 8340 loops 740 740 740
nodes 840 edges 608 loops 12 12 12
nodes 9548 edges 8367 loops 134 134 134
nodes 6 edges 9525 loops 6 6 6
nodes 1185 edges 7226 loops 581 581 581
nodes 8411 edges 8406 loops 0 0 0
nodes 3179 edges 2630 loops 0 0 0
nodes 9249 edges 6192 loops 0 0 0
nodes 4937 edges 9384 loops 657 657 657
nodes 6911 edges 1174 loops 0 0 0
nodes 6541 edges 5294 loops 0 0 0
nodes 4117 edges 6625 loops 324 324 324
nodes 6245 edges 449 loops 0 0 0
nodes 4289 edges 4160 loops 0 0 0
nodes 9031 edges 5296 loops 0 0 0
nodes 3083 edges 1959 loops 0 0 0
nodes 321 edges 7315 loops 270 270 270
nodes 6165 edges 4654 loops 5 5 5
nodes 4377 edges 877 loops 0 0 0
nodes 5009 edges 1100 loops 0 0 0
nodes 6081 edges 7288 loops 144 144 144
nodes 7559 edges 5344 loops 0 0 0
nodes 2729 edges 2896 loops 0 0 0
nodes 7425 edges 6250 loops 65 65 65
nodes 5134 edges 9532 loops 795 795 795
nodes 7571 edges 3600 loops 0 0 0
nodes 4475 edges 8744 loops 648 648 648
nodes 1182 edges 3650 loops 328 328 328
nodes 4268 edges 4450 loops 95 95 95
nodes 761 edges 284 loops 0 0 0
nodes 7021 edges 50 loops 0 0 0
nodes 6036 edges 7762 loops 370 370 370
nodes 4816 edges 6359 loops 327 327 327
nodes 2417 edges 9300 loops 1207 1207 1207
nodes 289 edges 5292 loops 259 259 259
. . . and so on . . .


During several hours running this test there were no disagreements between any of the implementations (after adjusting Robert's by -1).

The second idea I had was to generate graphs with all possible combinations of connections between nodes. This is only going to be practical for small graphs: for one node there are two possible graphs: the node has no outgoing edges and the node has an outgoing edge to itself. For two nodes there are 16 possible graphs ranging from the graph where neither node has any outgoing edges to the graph where each node has an outgoing edge to itself and to the other node. I think for n nodes there will be 2^(n^2) possible graphs (I'm assuming no duplicate edges). For 5 nodes there are 33,554,432 possible graphs and for 6 there are 68,719,476,736. So 5 nodes is probably about the limit of what I could reasonably check. Here's what I wrote to test all possible graphs up to 5 nodes

// generate all possible graphs having 'n' nodes and test
// all loop counting implementations return the same count
void test_all_possible_graphs(int n)
{
assert(1 <= n && n <= 5);
const int end = 1 << (n * n); // we will test this number of graphs
for (int bitmap = 0; bitmap < end; ++bitmap) {
/* think of bitmap as n groups of n bits; each node has
its own group, each bit in that group determines
whether there is an out edge from that node to the
node corresponding to that bit */

// create all nodes
anthay::mao_cfg antcfg;
rhundt::MaoCFG huncfg;
for (int a = 0; a < n; ++a) {
antcfg.create_node(a);
huncfg.CreateNode(a);
}

// add edges only where the corresponding bit in bitmap is set
int bit = 1;
for (int a = 0; a < n; ++a) {
for (int b = 0; b < n; ++b) {
if (bitmap & bit) {
antcfg.create_edge(a, b);
new rhundt::BasicBlockEdge(&huncfg, a, b);
}
bit <<= 1;
}
}

// count loops
anthay::loop_structure_graph antlsg;
const int ant_loops = anthay::find_havlak_loops(antcfg, antlsg);
rhundt::LoopStructureGraph hunlsg;
const int rhundt_loops = rhundt::FindHavlakLoops(&huncfg, &hunlsg) - 1;

// compare counts
if (rhundt_loops != ant_loops)
std::cout << "disagreement for 0x" << std::hex << bitmap << "\n";
if (bitmap % 1000000 == 0)
std::cerr << '.';
}
}

int main()
{
for (int n = 1; n <= 5; ++n)
test_all_possible_graphs(n);
}


It takes about 10 minutes to count the loops in all possible graphs having 5 nodes or fewer and confirm that in all cases my implementation returns the same loop count as Robert's. (I also tested against Russel's implementation, but that plotted a stairway to heaven on my memory usage monitor: Russel didn't free any of the memory he allocated. The memory usage for Hundt+Hay was a flat line.)

I'm sure there are other tests I could do. I could count loops in more 'hand-built' graphs like the original 76,001 looper, for example. But at this point I'm going to say that my implementation is an accurate analog of the original Robert Hundt code.

Is it efficient?


Actually, let's just ask if it's fast. Here are the timings for the three implementations running on my 2008 Mac with a 2.8 GHz Core 2 Duo processor:


$ g++
i686-apple-darwin11-llvm-g++-4.2: no input files
$
$
$ g++ -O3 LoopTesterApp.cpp mao-loops.cpp
$ time ./a.out
Welcome to LoopTesterApp, C++ edition
Constructing App...
Constructing Simple CFG...
15000 dummy loops
Constructing CFG...
Performing Loop Recognition
1 Iteration
Another 50 iterations...
..................................................
Found 76002 loops (including artificial root node)(3800100)
loop-0, nest: 0, depth: 0

real 0m31.326s
user 0m29.501s
sys 0m1.503s
$
$
$ g++ -O3 russcox.cpp
$ time ./a.out
# of loops: 76000 (including 1 artificial root node)

real 0m5.497s
user 0m5.355s
sys 0m0.086s
$
$
$ g++ -O3 antloops_test.cpp antloops.cpp
$ time ./a.out
counting loops...
num_loops = 76001
..................................................
real 0m6.511s
user 0m6.009s
sys 0m0.279s
$


In summary:

rhundt 31.3 seconds = 100% of rhundt
anthay 6.5 seconds = 20.8% of rhundt
russcox 5.5 seconds = 17.6% of rhundt

(The above is an update to the original post after I upgraded to GCC 4.2)

So Russel's code wins, but mine is a creditable runner-up.

It's interesting watching the random graph test program run: occasionally a graph will be generated where it looks like Robert's code will count the loops quicker than either Russel's or mine. I'm wary of benchmarks. People say that, like statistics, you can find a benchmark to support any point of view. Hang on!... I wonder if... Voilà!...

I added code to the random graph test to time how long each implementation took to count loops and ran it for 100 graphs. The code and the results are shown here.

// create a random graph and count the loops using all three implementations
void test_graph(
int num_nodes,
int num_edges,
clock_t & ant_ms,
clock_t & rus_ms,
clock_t & rhu_ms)
{
rhundt::MaoCFG rhcfg;
russcox::CFG russcfg;
anthay::mao_cfg antcfg;

generate_random_graph(rhcfg, russcfg, antcfg, num_nodes, num_edges);
std::cout
<< "nodes " << std::setw(6) << num_nodes
<< " edges " << std::setw(6) << num_edges
<< " loops ";

static_assert(CLOCKS_PER_SEC >= 1000, "code assumes at least millisecond clock() resolution");

clock_t start = clock() / (CLOCKS_PER_SEC / 1000);
anthay::loop_structure_graph antlsg;
const int ant_loops = anthay::find_havlak_loops(antcfg, antlsg);
clock_t stop = clock() / (CLOCKS_PER_SEC / 1000);
ant_ms = stop - start;
std::cout << std::setw(5) << ant_loops << " " << std::setw(5) << ant_ms << "ms";

start = clock() / (CLOCKS_PER_SEC / 1000);
russcox::LoopGraph russlsg;
russcox::LoopFinder f;
f.FindLoops(&russcfg, &russlsg);
const int rcox_loops = russlsg.loop.size();
stop = clock() / (CLOCKS_PER_SEC / 1000);
rus_ms = stop - start;
std::cout << " : " << std::setw(5) << rcox_loops << " " << std::setw(5) << rus_ms << "ms";

start = clock() / (CLOCKS_PER_SEC / 1000);
rhundt::LoopStructureGraph rhlsg;
// (to make consistent with others shift result form 1..n to 0..n-1)
const int rhundt_loops = rhundt::FindHavlakLoops(&rhcfg, &rhlsg) - 1;
stop = clock() / (CLOCKS_PER_SEC / 1000);
rhu_ms = stop - start;
std::cout << " : " << std::setw(5) << rhundt_loops << " " << std::setw(5) <v rhu_ms << "ms";

if (rhundt_loops != ant_loops || rcox_loops != ant_loops)
std::cout << " <=== disagreement";

std::cout << "\n";
}

int main()
{
clock_t ant_ms_total = 0;
clock_t rus_ms_total = 0;
clock_t rhu_ms_total = 0;

const int max_graph_size = 10000;

srand(20111002);
for (int i = 0; i < 100; ++i) {
clock_t ant_ms, rus_ms, rhu_ms;
test_graph((bigrand() % max_graph_size) + 1, bigrand() % max_graph_size, ant_ms, rus_ms, rhu_ms);
ant_ms_total += ant_ms;
rus_ms_total += rus_ms;
rhu_ms_total += rhu_ms;
}

std::cout
<< "ant ms " << ant_ms_total
<< " russcox ms " << rus_ms_total
<< " rhundt ms " << rhu_ms_total
<< "\n";
std::cout
<< "ant " << (100 * ant_ms_total) / rhu_ms_total
<< "% russcox " << (100 * rus_ms_total) / rhu_ms_total
<< "% rhundt " << (100 * rhu_ms_total) / rhu_ms_total
<< "%\n";
}

nodes 8834 edges 1940 loops 0 1ms : 0 2ms : 0 21ms
nodes 3112 edges 5351 loops 379 22ms : 379 107ms : 379 54ms
nodes 6031 edges 5518 loops 0 0ms : 0 0ms : 0 8ms
nodes 7929 edges 2756 loops 0 0ms : 0 1ms : 0 10ms
nodes 3841 edges 8340 loops 740 59ms : 740 292ms : 740 158ms
nodes 840 edges 608 loops 12 0ms : 12 1ms : 12 1ms
nodes 9548 edges 8367 loops 134 2ms : 134 39ms : 134 22ms
nodes 6 edges 9525 loops 6 1ms : 6 29ms : 6 7ms
nodes 1185 edges 7226 loops 581 110ms : 581 125ms : 581 288ms
nodes 8411 edges 8406 loops 0 0ms : 0 0ms : 0 11ms
nodes 3179 edges 2630 loops 0 0ms : 0 0ms : 0 5ms

. . . and so on . . .

nodes 126 edges 610 loops 52 0ms : 52 1ms : 52 3ms
nodes 758 edges 6658 loops 476 79ms : 476 81ms : 476 243ms
nodes 6553 edges 674 loops 0 0ms : 0 0ms : 0 10ms
nodes 7009 edges 4569 loops 0 1ms : 0 0ms : 0 9ms
nodes 6656 edges 157 loops 0 0ms : 0 0ms : 0 10ms
nodes 7617 edges 6113 loops 0 1ms : 0 0ms : 0 10ms
nodes 2392 edges 8981 loops 804 213ms : 804 378ms : 804 415ms
nodes 6829 edges 4110 loops 0 0ms : 0 0ms : 0 9ms
nodes 1899 edges 9336 loops 922 292ms : 922 346ms : 922 556ms
ant ms 2842 russcox ms 6680 rhundt ms 7368
ant 38% russcox 90% rhundt 100%



In this test Russel's implementation counted the loops in 90% of the time it took Robert's implementation to count the loops in the same graphs. But my implementation took only 38% of the time. My code wins. I think benchmarks can be a pretty accurate measure of performance, now I come to think about it.

I have my tongue in my cheek, in case you didn't know. I mean, these results are genuine, but I'm still wary of benchmarks. Just to confirm my code's superiority I changed max_graph_size to 50000 and reran the test. Here are the last few lines of output

                 . . . and so on . . .
nodes 29381 edges 11279 loops 0 1ms : 0 3ms : 0 42ms
nodes 8736 edges 46720 loops 3114 17774ms : 3114 28904ms : 3114 9482ms
nodes 24733 edges 26201 loops 26 4ms : 26 11ms : 26 51ms
nodes 13453 edges 35531 loops 3800 9737ms : 3800 22393ms : 3800 5261ms
nodes 40050 edges 42545 loops 1445 473ms : 1445 22515ms : 1445 704ms
nodes 9523 edges 35465 loops 4498 18358ms : 4498 22415ms : 4498 9013ms
nodes 8233 edges 1458 loops 0 0ms : 0 1ms : 0 15ms
nodes 39426 edges 13725 loops 0 2ms : 0 5ms : 0 56ms
nodes 23209 edges 46433 loops 3839 5555ms : 3839 41132ms : 3839 3542ms
nodes 20855 edges 31847 loops 1462 362ms : 1462 9499ms : 1462 565ms
nodes 19441 edges 19568 loops 0 1ms : 0 2ms : 0 31ms
nodes 22524 edges 30654 loops 1701 765ms : 1701 14722ms : 1701 907ms
nodes 16581 edges 43258 loops 4331 16599ms : 4331 50270ms : 4331 7703ms
nodes 8926 edges 40470 loops 4035 24388ms : 4035 34164ms : 4035 11233ms
nodes 41346 edges 27901 loops 0 2ms : 0 6ms : 0 58ms
nodes 11934 edges 16560 loops 944 151ms : 944 2433ms : 944 305ms
nodes 23446 edges 16700 loops 0 1ms : 0 3ms : 0 34ms
nodes 23147 edges 16770 loops 0 2ms : 0 3ms : 0 34ms
nodes 1601 edges 43744 loops 1487 4655ms : 1487 4644ms : 1487 5378ms
nodes 8003 edges 15245 loops 1144 179ms : 1144 1550ms : 1144 335ms
ant ms 312512 russcox ms 755303 rhundt ms 177003
ant 176% russcox 426% rhundt 100%


In this test my code took twice as long as Robert's and Russel's code took twice as long as mine! Robert's code wins. I'm going off benchmarks again.

Three different benchmarks. Three different winners. It would be interesting to analyse the performance characteristics of these three implementations of the same algorithm to understand what's happening. But that's for another time.

Is it readable?


From the user's perspective code readability is neither here nor there. But in my experience from the programmer's perspective, you are more likely to achieve some of the things the user does care about, like being correct, if you can reason about the code. And to reason about the code you need to understand it. And code readability aids understanding. I believe the way the code looks is of lesser importance than other goals, but it isn't of no importance.

The core analyze_loops algorithm in the Havlak paper (reproduced in the Google paper) is written in a mathematical pseudocode. It is 30 lines long. I think there is a fairly clear resemblance to that pseudocode in my implementation, which is 53 lines long. One line of pseudocode to two of real code is a pretty good ratio, I think. (Note: I don't count blank lines, comments or lines containing only a curly brace.)

In general I think the resemblance between someone's description of an algorithm and the concrete realisation of that algorithm in a real programming language that actually works on a real computer is a good thing. Sometimes the algorithm description is clear but the computer language doesn't allow the realisation to reflect it.

It may be that computer scientists who publish algorithms often describe them in a way suited to implementation in C++ because they all use C or C++ to develop their algorithms. I don't know.

In what way is an algorithm "readable?" Some algorithms are simple and you can just see what they do without much difficulty. And some you can't keep enough in your head to just "see." In that case you resort to pencil and paper and boxes joined to other boxes with arrows. You work through simple concrete examples following each step in the algorithm to see how it transforms the data. That's what I do anyway.

Candid conclusion


The code is short, but I didn't find it easy to write. I bought the Havlak paper for $15 from the ACM, but I didn't find it easy to move from the abstract description of the complex algorithm in the paper to the concrete representation of a working program; without the two Google examples by Robert Hundt and Russel Cox I may not have succeeded. I do not intend anything I've said here to imply any criticism of those two guys' work; I don't know them personally but I have no doubt they have a far deeper understanding of the Havlak algorithm than I do. But I suspect they were not very interested in the C++ implementation of this algorithm, perhaps because C++ is not a cool language. (Update: I regret saying this last bit; how do I know how interested they were? It was a snide comment. Sorry.)

One of my difficulties was a certain amount of confusion about when I should be using the basic block name, an int, and when the preorder number, also an int. Using the two typedefs for these two concepts helped me keep them separate.

The most interesting thing I learned was the UNION/FIND algorithm, which I'm sad to say, I had never before seen (or if I had I had forgotten). That path compression is really neat.

In the above narrative I may have made it sound like I implemented the algorithm, checked it against the Google results and found it was correct. This is not how it happened. I fiddled around for some time getting wrong answers and trying to figure out what mistakes I had made. So what? I believe a poet might work on a poem for years before it's published. Right matters. Right first time is overrated. Since I'm not 100% sure this code is right I'm going to stop pontificating right here.



index of blog posts

5 comments:

  1. Congratulations for this really great job !

    Our congratulations are particulary sincere for we have (almost) been through the same experiment :
    reimplementing the Hundt's C++ version.
    Our approach was different, we are developping a new langage (a new hammer ;-), so we wanted to produce exact "cut and past" of Hundt's version : "you push back an element ? ok, we do it too, next move ?", and so on...
    but with our own langage/STL...
    We even copied the getters/setters functions.
    We were not really interested in the algorithm efficiency itself but by the comparison between C++ and Cawen.
    The results are here : http://melvenn.com/en/home/
    We don't know if C++ is cool or not ;-), but we will be very happy to implement your own version as soon as we find some time to do it !

    TS & GC

    ReplyDelete
    Replies
    1. Hi TS & GC. Thank you for your kind compliment. I wish you well with your endeavour (I looked but couldn't actually find any code written in your new language). I think computer languages are fascinating and useful things. I'd love to invent one myself. I recently read that when Jerry Weinberg was asked what his greatest contribution to the field of computer science was he said "That's easy. I answered that some years ago, and my answer hasn't changed. My biggest contribution to the software world is that I never invented yet another programming language." (from http://jonjagger.blogspot.co.uk/2010/09/interview-with-jerry-weinberg.html)

      Delete
  2. He he he, the quote is excellent.
    It reminds me of this one : "The nice thing about standards is that there are so many of them to choose
    from."
    But, please believe that we did not do too much harm to the software world, since
    1) Cawen is not exactly a new langage...It should better be considered as a way to template (struct and functions) good old C (who has never been cool but does not mind, for he knows he is the King ;-)...It's some kind of "back to the future VI" language...You can download the C99 sources the Cawen preprocessor has generated here :http://www.melvenn.com/data/havlak_cawen_benchmark.tgz

    2) may be no one will never hear about it expect you, R.Hundt, my friend Thomas and me.

    regards,
    Gwenael Chailleu

    ReplyDelete
  3. Great Job, Anthony, just saw this now...

    Regards
    Robert Hundt

    ReplyDelete
    Replies
    1. Hi Robert. I'm honoured you took time to read it. Thank you for creating the original paper.

      Delete