How to Write a Program: Loop counting: Google's multi-language-bench

C++ is not a cool language. I think it once was, back in the early 1990s; certainly there were a lot of books and magazine articles written about it at that time. Back then I wrote code mostly in C. I think it took quite a while before I really got object orientation, but eventually I became comfortable with C++, probably around the time that Java became the new cool language.

I mention this because it made me smile when a little while ago a friend showed me a paper (pdf) from Google that described how they implemented the same algorithm in several different languages and found the one written in C++ the fastest. You know how brand loyalty is apparently tied up with our self-esteem? Hey, C++ may not be cool, but when you got important things you gotta do fast you need a tool man-enough for the job, I thought. Of course, benchmarks are notoriously unreliable, and I didn't really believe this. But it still made me smile.

Maybe I shouldn't let my self-esteem get entangled with a programming language. But I think it's not unnatural that after spending considerable time with a language one may become attached to it. I'd like to state that I am aware of this proverb: if all you have is a hammer, everything looks like a nail; I subscribe to the notion that learning an unfamiliar programming language may broaden your mind. A little. Which is a good thing.

Reading the Google paper I came across this

“We ﬁnd that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.”

They have a high opinion of themselves don’t they? Doubtless it’s justified. But it sounds like a challenge...

One of the promises of object oriented design is that you may be able to change the implementation of an object without changing its interface. So I wondered if I could improve on the performance of the Google C++ implementation without changing the test driver. After about 30 seconds of reading the code I decided that that sounded too much like hard work: much more fun to continue in the great programming tradition and rewrite the code. (The Google code has objects with getters that return pointers to private member variables; this restricts the freedom to change the implementation. Breaking encapsulation may be justified for run-time performance gains. The penalty is that a change to the implementation may mean a change to the user code, or a run-time penalty to maintain the old interface.)

So anyway, I wrote the version shown below. To make execution time comparisons meaningful I tried to keep the functionality identical to the Google C++ implementation from Robert Hundt's paper. Russel Cox also implemented the algorithm. So when I talk about Robert's code or Russel's code these are what I'm referring to.

antloops.h

#ifndef ANTLOOPS_H_INCLUDED
#define ANTLOOPS_H_INCLUDED
/*  This code is an implementation of the analyze_loops algorithm from
    Nesting of Reducible and Irriducible Loops by Paul Havlak published
    in ACM ToPLaS, July 1997. (Referred to below as "Havlak's paper.")
    See http://howtowriteaprogram.blogspot.com/

    Copyright (c) 2011 Anthony C. Hay. Some Rights Reserved.
    Published under the BSD 2-clause license.
    See http://www.opensource.org/licenses/BSD-2-Clause
*/

#include <vector>
#include <stdexcept>


namespace anthay {

typedef int block_name;

/*  a basic_block is a node or vertex in a control-flow graph
    (a basic block is a sequence of zero or more statements in a computer
    program that would be executed in-sequence with no jumps into or
    out of the sequence, described briefly in Havlak's paper; it doesn't
    matter, for our purposes it's just a node in a graph) */
class basic_block {
public:
    void add_out_edge(block_name to)
    {
        out_edges_.push_back(to);
    }

    void add_in_edge(block_name from)
    {
        in_edges_.push_back(from);
    }

    typedef std::vector<block_name> edge_vec_t;
    const edge_vec_t & out_edges() const
    {
        return out_edges_;
    }
    const edge_vec_t & in_edges() const
    {
        return in_edges_;
    }

private:
    edge_vec_t out_edges_;
    edge_vec_t in_edges_;
};


// represent one control-flow graph
class mao_cfg {
public:
    mao_cfg()
    : start_node_(undefined)
    {
    }

    void reserve(int size)
    {
        node_vec_.reserve(size);
    }

    // return a pointer to a basic_block with the given 'name';
    // pointer is guaranteed valid only until next call to any non-const function
    basic_block * create_node(block_name name)
    {
        const int size = static_cast<int>(node_vec_.size());
        if (size == 0)
            start_node_ = name; // assume first created block is graph start
        if (name >= size)
            node_vec_.resize(name + 1);
        return &node_vec_[name];
    }

    void create_edge(block_name from, block_name to)
    {
        create_node(from)->add_out_edge(to);
        create_node(to)->add_in_edge(from);
    }

    // return the name of the start node
    block_name start_node() const
    {
        if (start_node_ == undefined)
            throw std::runtime_error("undefined start node");
        return start_node_;
    }

    // return the number of nodes in the graph
    int size() const
    {
        return node_vec_.size();
    }

    // return a reference to the node with the given 'name';
    // reference guaranteed valid only until next call to any non-const function
    const basic_block & node(block_name name) const
    {
        return node_vec_[name];
    }

private:
    enum { undefined = -1 };
    block_name start_node_;
    typedef std::vector<basic_block> node_vec_t;
    node_vec_t node_vec_;
};


class loop_structure_graph {
public:
    loop_structure_graph() : count_(0) {}
    void clear() { count_ = 0; }
    void add_loop() { ++count_; }
    int size() const { return count_; }

    void calculate_nesting_level()
    {
        /*  I'm not implementing this because it has no effect on
            the performance of either the Robert Hundt or Russel Cox
            C++ implementations of the Havlak algorithm. Robert's
            LoopStructureGraph::CalculateNestingLevel() is never
            called. Russel's LoopGraph::CalculateNesting() is called,
            but as Loop::isRoot is not initialised (and is therefore
            mostly "true"), it does (almost) nothing. */
    }

private:
    int count_;
};


// find loops in given 'cfg' using the analyze_loops algorithm from Havlak's
// paper; return the number of loops found
int find_havlak_loops(const mao_cfg & cfg, loop_structure_graph & lsg);

}
#endif

antloops.cpp

/*  This code is an implementation of the analyze_loops algorithm from
    Nesting of Reducible and Irriducible Loops by Paul Havlak published
    in ACM ToPLaS, July 1997. (Referred to below as "Havlak's paper.")
    See http://howtowriteaprogram.blogspot.com/

    Copyright (c) 2011 Anthony C. Hay. Some Rights Reserved.
    Published under the BSD 2-clause license.
    See http://www.opensource.org/licenses/BSD-2-Clause
*/

#include "antloops.h"
#include <algorithm>

namespace anthay {
namespace {

typedef int preorder;


// implements the DFS algorithm from Havlak's paper
class depth_first_search {
public:
    depth_first_search(const mao_cfg & cfg)
    : cfg_(&cfg), current_preorder_(0),
      node_(cfg.size(), unvisited), number_(cfg.size(), unvisited), last_(cfg.size())
    {
        dfs(cfg.start_node());
    }

    // return the preorder number of the last visited node, or -1
    // if no nodes were visited
    preorder last_preorder() const
    {
        return current_preorder_ - 1;
    }

    // return true iff 'p' is set to the preorder number assigned to node
    // with the given name 'a'; false means there was no path to 'a'
    bool preorder_number(block_name a, preorder & p) const
    {
        p = number_[a];
        return p != unvisited;
    }

    // return the block name with the given preorder number 'p';
    // 'p' _must_ be in the range 0 .. last_preorder()
    block_name block(preorder p) const
    {
        return node_[p];
    }

    // return true iff node 'w' is an ancestor of node 'v', where
    // w and v are the preorder numbers of the nodes
    bool is_ancestor(preorder w, preorder v) const
    {
        return w <= v && v <= last_[w];
    }

private:
    const mao_cfg * cfg_;
    preorder current_preorder_;     // used in dfs() to assign a preorder number to each node visited
    std::vector<block_name> node_;  // node_[a] is the block_name of node with preorder number a
    std::vector<preorder> number_;  // number_[a] is the depth-first (or preorder) number of node named a
    std::vector<preorder> last_;    // last_[number_[a]] is the last decendant of node a

    enum { unvisited = -1 };

    // recursive depth-first search; the DFS algorithm from Havlak
    void dfs(block_name a)
    {
        node_[current_preorder_] = a;
        number_[a] = current_preorder_++;
        // foreach node b such that there exists an edge (a,b) do...
        const basic_block::edge_vec_t & out_edges = cfg_->node(a).out_edges();
        for (unsigned i = 0; i < out_edges.size(); ++i)
            if (number_[out_edges[i]] == unvisited)
                dfs(out_edges[i]);
        last_[number_[a]] = current_preorder_ - 1;
    }
};


// implements the UNION/FIND algorithm required by Havlak's algorithm
class union_find {
public:
    // create 'size' elements named 0 .. size-1, with no connections between elements
    explicit union_find(int size)
    : element_(size)
    {
        for (int i = 0; i < size; ++i)
            element_[i] = i;
    }

    // return the name of the root element of the element named 'p'
    // (beneficial side-effect: the path from 'p' to the root is compressed)
    int find_root(int p)
    {
        // I found this a good place to learn about the UNION/FIND algorithm:
        // http://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf
        while (p != element_[p])
            p = element_[p] = element_[element_[p]];
        return p;
    }

    // merge the set containing element 'a' into the set containing element 'b'
    void unite(int a, int b)
    {
        const int ra = find_root(a);
        element_[ra] = find_root(b);
    }

private:
    std::vector<preorder> element_;
};


// return true iff given 'v' contains given element 'elem' 
inline bool is_member(const std::vector<preorder> & v, preorder elem)
{
    return std::find(v.begin(), v.end(), elem) != v.end();
}

const int max_non_back_preds = 32 * 1024;

}// anonymous namespace



// find loops in given 'cfg' using the analyze_loops algorithm from Havlak's
// paper; record loops in given 'lsg' and return the number of loops found
int find_havlak_loops(const mao_cfg & cfg, loop_structure_graph & lsg)
{
    if (cfg.size() == 0)
        return 0;

    // step a:
    const depth_first_search dfs(cfg);

    // step b:
    enum node_type { non_header, self, reducible, irreducible };
    // type[p] is node_type of node with preorder number p
    std::vector<int> type(dfs.last_preorder() + 1, non_header);
    std::vector<preorder> header(cfg.size(), cfg.start_node());
    std::vector<std::vector<preorder> > back_preds(cfg.size());     // backedge predecessors
    std::vector<std::vector<preorder> > non_back_preds(cfg.size()); // non-backedge predecessors

    for (preorder w = 0; w <= dfs.last_preorder(); ++w) {
        const basic_block::edge_vec_t & in_edges = cfg.node(dfs.block(w)).in_edges();
        for (unsigned j = 0; j < in_edges.size(); ++j) {
            preorder v;
            if (!dfs.preorder_number(in_edges[j], v))
                continue; // in_edges[j] was not visited
            if (dfs.is_ancestor(w, v))
                back_preds[w].push_back(v);
            else if (!is_member(non_back_preds[w], v))
                non_back_preds[w].push_back(v);
        }
    }
    header[cfg.start_node()] = 0;

    // step c:
    union_find preorder_set(dfs.last_preorder() + 1);
    std::vector<preorder> p;
    for (preorder w = dfs.last_preorder(); w >= 0; --w) {
        p.clear();

        // step d:
        const std::vector<preorder> & back_preds_w = back_preds[w];
        typedef std::vector<preorder>::const_iterator preorder_iter;
        for (preorder_iter v = back_preds_w.begin(); v != back_preds_w.end(); ++v) {
            if (*v != w)
                p.push_back(preorder_set.find_root(*v));
            else
                type[w] = self;
        }
        if (!p.empty()) {
            type[w] = reducible;
            for (unsigned ix = 0; ix != p.size(); ++ix) {
                const preorder x = p[ix];

                // step e:
                const std::vector<preorder> & non_back_preds_set = non_back_preds[x];
                if (non_back_preds_set.size() > max_non_back_preds) {
                    // algorithm has degenerated, aparently
                    lsg.clear();
                    return 0;
                }
                for (preorder_iter y = non_back_preds_set.begin(); y != non_back_preds_set.end(); ++y) {
                    const preorder yprime = preorder_set.find_root(*y);
                    if (!dfs.is_ancestor(w, yprime)) {
                        type[w] = irreducible;
                        if (!is_member(non_back_preds[w], yprime))
                            non_back_preds[w].push_back(yprime);
                    }
                    else if (yprime != w && !is_member(p, yprime))
                        p.push_back(yprime);
                }
            }
        }

        if (!p.empty() || type[w] == self) {
            for (preorder_iter x = p.begin(); x != p.end(); ++x) {
                header[*x] = w;
                preorder_set.unite(*x, w);
            }
            lsg.add_loop();
        }
    }

    return lsg.size();
}


}

antloops_test.cpp

/*  This code is an implementation of the test driver in the paper
    Loop Recognition in C++/Java/Go/Scala by Robert Hundt.
    See http://code.google.com/p/multi-language-bench/

    Copyright (c) 2011 Anthony C. Hay. Some Rights Reserved.
    Published under the BSD 2-clause license.
    See http://www.opensource.org/licenses/BSD-2-Clause
    See also http://howtowriteaprogram.blogspot.com/
*/

#include "antloops.h"
using namespace anthay;

#include <iostream>
#include <cassert>


// create a diamond from 'start' to two intermediate nodes back to one
// end node; return the name (or index) of the end node
int build_diamond(mao_cfg & cfg, int start)
{
    cfg.create_edge(start, start + 1);
    cfg.create_edge(start, start + 2);
    cfg.create_edge(start + 1, start + 3);
    cfg.create_edge(start + 2, start + 3);
    return start + 3;
}


// when n > 0, create a daisy chain of nodes containing 'n' edges:
// start -> start+1 -> start+2 -> ... -> start+n
// return the name of the last node in the chain
int build_straight(mao_cfg & cfg, int start, int n)
{
    for (int i = 0; i < n; ++i)
        cfg.create_edge(start + i, start + i + 1);
    return start + n;
}


// create two diamonds in series, add three further edges to introduce
// 3 loops; return the name of the last node in the created graph
int build_base_loop(mao_cfg & cfg, int from)
{
    int header = build_straight(cfg, from, 1);
    int diamond1 = build_diamond(cfg, header);
    int d11 = build_straight(cfg, diamond1, 1);
    int diamond2 = build_diamond(cfg, d11);
    int footer = build_straight(cfg, diamond2, 1);
    cfg.create_edge(diamond2, d11);
    cfg.create_edge(diamond1, header);
    cfg.create_edge(footer, from);
    footer = build_straight(cfg, footer, 1);
    return footer;
}


int main()
{
    // build a control flow graph and count loops like the code
    // in Robert Hundt's LoopTesterApp.cpp so that execution time
    // comparisons with that code are meaningful
    std::cerr << "counting loops...";

    mao_cfg cfg;
    cfg.create_node(0);
    build_base_loop(cfg, 0);
    cfg.create_node(1);
    cfg.create_edge(0, 2);

    for (int i = 0; i < 15000; ++i) {
        loop_structure_graph lsg;
        find_havlak_loops(cfg, lsg);
    }

    int n = 2;
    for (int parlooptrees = 0; parlooptrees < 10; parlooptrees++) {
        cfg.create_edge(2, ++n);

        for (int i = 0; i < 100; i++) {
            const int top = n;
            n = build_straight(cfg, n, 1);
            for (int j = 0; j < 25; j++)
                n = build_base_loop(cfg, n);
            const int bottom = build_straight(cfg, n, 1);
            cfg.create_edge(n, top);
            n = bottom;
        }
        cfg.create_edge(n, 1);
    }

    loop_structure_graph lsg;
    const int num_loops = find_havlak_loops(cfg, lsg);
    std::cerr << "\nnum_loops = " << num_loops << "\n";
    assert(num_loops == 76001);

    for (int i = 0; i < 50; i++) {
        loop_structure_graph lsg;
        std::cerr << ".";
        find_havlak_loops(cfg, lsg);
    }

    lsg.calculate_nesting_level();
}

(I found I needed to reserve about 10 MB stack space to get Robert's code to produce a result. You may need to do the same for mine.)

Probably the most interesting difference between the implementations is that Robert's and Russel's both use explicit news and deletes and mine does not. I don't mean to say that there is anything wrong with raw pointers - I use them all the time - I simply chose a different approach to the problem using vectors and indices instead. The fact that the nodes are named with successive integers and may be referenced by a 'preorder' number, an integer, suggested a vector would be an obvious data structure to use.

Using collections, such as vectors, can make the code simpler, less likely to leak memory and easier to make exception-safe.

I think the code is in the spirit of the Google paper, which was that it should be written using "the default, idiomatic data structures in each language, as well as default type systems, memory allocation schemes, and default iteration constructs the basic facilities provided in the language."

Is it correct?

If it was not a requirement to be correct we could give an answer to any problem in an arbitrarily short time. So let's ask first if the code gives the right answer.

My code calculates there are 76,001 loops in the test graph. Robert's gives 76,002 and Russel's gives 76,000. Oh dear. Which is right?

Playing with Robert's code I found that it always exaggerates the number of loops by exactly one; for example it returns 1 for a graph containing no loops at all. So adjusting for this means mine and Robert's agree.

I corresponded briefly with Russel and he was kind enough to respond to my questions. It turns out there is a slight difference between Russel's graph and Robert's in the first few nodes: in Robert's there is an edge connecting nodes 1 -> 2, in Russel's there isn't. This difference means there is one less loop in Russel's graph, so his result is correct for his graph.

Looking at the code that builds the graph I would have said there were ((((3 x 25) + 1) x 100) + 1) x 10 = 76,010. But I'd be wrong, probably because overlapping loops are not counted more than once, so the 'and 10' becomes 'and 1'.

In short, I believe there are 76,001 loops in the Robert Hundt C++ graph and my implementation counts them correctly.

I like to remember I have an amazing symbol manipulation tool at my command that I can use to help me check my own work. (Not my own brain, lol.) I mean I may be able to write some code to test my code. Since I have the two Google implementations to test against the obvious thing to do is to build some graphs and compare the loop counts returned by each implementation. I wrapped each in its own namespace and linked all three into one test program.

The first idea I had was to generate graphs containing a random number of nodes with a random number of edges between random nodes, like this

int bigrand()
{
    return rand() * rand();
}

// generate three representations of the same random graph
void generate_random_graph(
    rhundt::MaoCFG & rhcfg,
    russcox::CFG & russcfg,
    anthay::mao_cfg & antcfg,
    int num_nodes,
    int num_edges)
{
    // create all nodes; nodes are named 0..num_nodes-1
    for (block_name b = 0; b < num_nodes; ++b) {
        rhcfg.CreateNode(b);
        russcfg.NewBlock();
        antcfg.create_node(b);
    }

    // link pairs of nodes chosen at random
    for (int i = 0; i < num_edges; ++i) {
        const block_name a = bigrand() % num_nodes;
        const block_name b = bigrand() % num_nodes;
        // we don't care if we link a node to itself
        // or duplicate an existing link

        new rhundt::BasicBlockEdge(&rhcfg, a, b);
        russcfg.Connect(russcfg.block[a], russcfg.block[b]);
        antcfg.create_edge(a, b);
    }
}

// create a random graph and count the loops using all three implementations
void test_graph(int num_nodes, int num_edges)
{
    rhundt::MaoCFG rhcfg;
    russcox::CFG russcfg;
    anthay::mao_cfg antcfg;

    generate_random_graph(rhcfg, russcfg, antcfg, num_nodes, num_edges);
    std::cerr
        << "nodes " << std::setw(6) << num_nodes
        << " edges " << std::setw(6) << num_edges
        << " loops ";

    anthay::loop_structure_graph antlsg;
    const int ant_loops = anthay::find_havlak_loops(antcfg, antlsg);
    std::cerr << std::setw(5) << ant_loops;

    russcox::LoopGraph russlsg;
    russcox::LoopFinder f;
    f.FindLoops(&russcfg, &russlsg);
    const int rcox_loops = russlsg.loop.size();
    std::cerr << " " << std::setw(5) << rcox_loops;

    rhundt::LoopStructureGraph rhlsg;
    // (to make consistent with others shift result form 1..n to 0..n-1)
    const int rhundt_loops = rhundt::FindHavlakLoops(&rhcfg, &rhlsg) - 1;
    std::cerr << " " << std::setw(5) << rhundt_loops;

    if (rhundt_loops != ant_loops || rcox_loops != ant_loops)
        std::cerr << " <=== disagreement";

    std::cerr << "\n";
}

int main()
{
    const int max_graph_size = 10000;

    srand(20111002);
    for (;;)
        test_graph((bigrand() % max_graph_size) + 1, bigrand() % max_graph_size);
}

Here is the output from a typical run

nodes   8834 edges   1940 loops     0     0     0
nodes   3112 edges   5351 loops   379   379   379
nodes   6031 edges   5518 loops     0     0     0
nodes   7929 edges   2756 loops     0     0     0
nodes   3841 edges   8340 loops   740   740   740
nodes    840 edges    608 loops    12    12    12
nodes   9548 edges   8367 loops   134   134   134
nodes      6 edges   9525 loops     6     6     6
nodes   1185 edges   7226 loops   581   581   581
nodes   8411 edges   8406 loops     0     0     0
nodes   3179 edges   2630 loops     0     0     0
nodes   9249 edges   6192 loops     0     0     0
nodes   4937 edges   9384 loops   657   657   657
nodes   6911 edges   1174 loops     0     0     0
nodes   6541 edges   5294 loops     0     0     0
nodes   4117 edges   6625 loops   324   324   324
nodes   6245 edges    449 loops     0     0     0
nodes   4289 edges   4160 loops     0     0     0
nodes   9031 edges   5296 loops     0     0     0
nodes   3083 edges   1959 loops     0     0     0
nodes    321 edges   7315 loops   270   270   270
nodes   6165 edges   4654 loops     5     5     5
nodes   4377 edges    877 loops     0     0     0
nodes   5009 edges   1100 loops     0     0     0
nodes   6081 edges   7288 loops   144   144   144
nodes   7559 edges   5344 loops     0     0     0
nodes   2729 edges   2896 loops     0     0     0
nodes   7425 edges   6250 loops    65    65    65
nodes   5134 edges   9532 loops   795   795   795
nodes   7571 edges   3600 loops     0     0     0
nodes   4475 edges   8744 loops   648   648   648
nodes   1182 edges   3650 loops   328   328   328
nodes   4268 edges   4450 loops    95    95    95
nodes    761 edges    284 loops     0     0     0
nodes   7021 edges     50 loops     0     0     0
nodes   6036 edges   7762 loops   370   370   370
nodes   4816 edges   6359 loops   327   327   327
nodes   2417 edges   9300 loops  1207  1207  1207
nodes    289 edges   5292 loops   259   259   259
                 . . . and so on . . .

During several hours running this test there were no disagreements between any of the implementations (after adjusting Robert's by -1).

The second idea I had was to generate graphs with all possible combinations of connections between nodes. This is only going to be practical for small graphs: for one node there are two possible graphs: the node has no outgoing edges and the node has an outgoing edge to itself. For two nodes there are 16 possible graphs ranging from the graph where neither node has any outgoing edges to the graph where each node has an outgoing edge to itself and to the other node. I think for n nodes there will be 2^(n^2) possible graphs (I'm assuming no duplicate edges). For 5 nodes there are 33,554,432 possible graphs and for 6 there are 68,719,476,736. So 5 nodes is probably about the limit of what I could reasonably check. Here's what I wrote to test all possible graphs up to 5 nodes

// generate all possible graphs having 'n' nodes and test
// all loop counting implementations return the same count
void test_all_possible_graphs(int n)
{
    assert(1 <= n && n <= 5);
    const int end = 1 << (n * n); // we will test this number of graphs
    for (int bitmap = 0; bitmap < end; ++bitmap) {
        /*  think of bitmap as n groups of n bits; each node has
            its own group, each bit in that group determines
            whether there is an out edge from that node to the
            node corresponding to that bit */

        // create all nodes
        anthay::mao_cfg antcfg;
        rhundt::MaoCFG huncfg;
        for (int a = 0; a < n; ++a) {
            antcfg.create_node(a);
            huncfg.CreateNode(a);
        }

        // add edges only where the corresponding bit in bitmap is set
        int bit = 1;
        for (int a = 0; a < n; ++a) {
            for (int b = 0; b < n; ++b) {
                if (bitmap & bit) {
                    antcfg.create_edge(a, b);
                    new rhundt::BasicBlockEdge(&huncfg, a, b);
                }
                bit <<= 1;
            }
        }

        // count loops
        anthay::loop_structure_graph antlsg;
        const int ant_loops = anthay::find_havlak_loops(antcfg, antlsg);
        rhundt::LoopStructureGraph hunlsg;
        const int rhundt_loops = rhundt::FindHavlakLoops(&huncfg, &hunlsg) - 1;

        // compare counts
        if (rhundt_loops != ant_loops)
            std::cout << "disagreement for 0x" << std::hex << bitmap << "\n";
        if (bitmap % 1000000 == 0)
            std::cerr << '.';
    }
}

int main()
{
    for (int n = 1; n <= 5; ++n)
        test_all_possible_graphs(n);
}

It takes about 10 minutes to count the loops in all possible graphs having 5 nodes or fewer and confirm that in all cases my implementation returns the same loop count as Robert's. (I also tested against Russel's implementation, but that plotted a stairway to heaven on my memory usage monitor: Russel didn't free any of the memory he allocated. The memory usage for Hundt+Hay was a flat line.)

I'm sure there are other tests I could do. I could count loops in more 'hand-built' graphs like the original 76,001 looper, for example. But at this point I'm going to say that my implementation is an accurate analog of the original Robert Hundt code.

Is it efficient?

Actually, let's just ask if it's fast. Here are the timings for the three implementations running on my 2008 Mac with a 2.8 GHz Core 2 Duo processor:


$ g++
i686-apple-darwin11-llvm-g++-4.2: no input files
$ 
$ 
$ g++ -O3 LoopTesterApp.cpp mao-loops.cpp
$ time ./a.out
Welcome to LoopTesterApp, C++ edition
Constructing App...
Constructing Simple CFG...
15000 dummy loops
Constructing CFG...
Performing Loop Recognition
1 Iteration
Another 50 iterations...
..................................................
Found 76002 loops (including artificial root node)(3800100)
loop-0, nest: 0, depth: 0

real 0m31.326s
user 0m29.501s
sys 0m1.503s
$ 
$ 
$ g++ -O3 russcox.cpp
$ time ./a.out
# of loops: 76000 (including 1 artificial root node)

real 0m5.497s
user 0m5.355s
sys 0m0.086s
$ 
$ 
$ g++ -O3 antloops_test.cpp antloops.cpp
$ time ./a.out
counting loops...
num_loops = 76001
..................................................
real 0m6.511s
user 0m6.009s
sys 0m0.279s
$

In summary:

rhundt 31.3 seconds = 100% of rhundt
anthay 6.5 seconds = 20.8% of rhundt
russcox 5.5 seconds = 17.6% of rhundt

(The above is an update to the original post after I upgraded to GCC 4.2)

So Russel's code wins, but mine is a creditable runner-up.

It's interesting watching the random graph test program run: occasionally a graph will be generated where it looks like Robert's code will count the loops quicker than either Russel's or mine. I'm wary of benchmarks. People say that, like statistics, you can find a benchmark to support any point of view. Hang on!... I wonder if... Voilà!...

I added code to the random graph test to time how long each implementation took to count loops and ran it for 100 graphs. The code and the results are shown here.

// create a random graph and count the loops using all three implementations
void test_graph(
    int num_nodes,
    int num_edges,
    clock_t & ant_ms,
    clock_t & rus_ms,
    clock_t & rhu_ms)
{
    rhundt::MaoCFG rhcfg;
    russcox::CFG russcfg;
    anthay::mao_cfg antcfg;

    generate_random_graph(rhcfg, russcfg, antcfg, num_nodes, num_edges);
    std::cout
        << "nodes " << std::setw(6) << num_nodes
        << " edges " << std::setw(6) << num_edges
        << " loops ";

    static_assert(CLOCKS_PER_SEC >= 1000, "code assumes at least millisecond clock() resolution");

    clock_t start = clock() / (CLOCKS_PER_SEC / 1000);
    anthay::loop_structure_graph antlsg;
    const int ant_loops = anthay::find_havlak_loops(antcfg, antlsg);
    clock_t stop = clock() / (CLOCKS_PER_SEC / 1000);
    ant_ms = stop - start;
    std::cout << std::setw(5) << ant_loops << " " << std::setw(5) << ant_ms << "ms";

    start = clock() / (CLOCKS_PER_SEC / 1000);
    russcox::LoopGraph russlsg;
    russcox::LoopFinder f;
    f.FindLoops(&russcfg, &russlsg);
    const int rcox_loops = russlsg.loop.size();
    stop = clock() / (CLOCKS_PER_SEC / 1000);
    rus_ms = stop - start;
    std::cout << " : " << std::setw(5) << rcox_loops << " " << std::setw(5) << rus_ms << "ms";

    start = clock() / (CLOCKS_PER_SEC / 1000);
    rhundt::LoopStructureGraph rhlsg;
    // (to make consistent with others shift result form 1..n to 0..n-1)
    const int rhundt_loops = rhundt::FindHavlakLoops(&rhcfg, &rhlsg) - 1;
    stop = clock() / (CLOCKS_PER_SEC / 1000);
    rhu_ms = stop - start;
    std::cout << " : " << std::setw(5) << rhundt_loops << " " << std::setw(5) <v rhu_ms << "ms";

    if (rhundt_loops != ant_loops || rcox_loops != ant_loops)
        std::cout << " <=== disagreement";

    std::cout << "\n";
}

int main()
{
    clock_t ant_ms_total = 0;
    clock_t rus_ms_total = 0;
    clock_t rhu_ms_total = 0;

    const int max_graph_size = 10000;

    srand(20111002);
    for (int i = 0; i < 100; ++i) {
        clock_t ant_ms, rus_ms, rhu_ms;
        test_graph((bigrand() % max_graph_size) + 1, bigrand() % max_graph_size, ant_ms, rus_ms, rhu_ms);
        ant_ms_total += ant_ms;
        rus_ms_total += rus_ms;
        rhu_ms_total += rhu_ms;
    }

    std::cout
        << "ant ms " << ant_ms_total 
        << " russcox ms " << rus_ms_total 
        << " rhundt ms " << rhu_ms_total 
        << "\n";
    std::cout
        << "ant " << (100 * ant_ms_total) / rhu_ms_total
        << "% russcox " << (100 * rus_ms_total) / rhu_ms_total
        << "% rhundt " << (100 * rhu_ms_total) / rhu_ms_total
        << "%\n";
}

nodes   8834 edges   1940 loops     0     1ms :     0     2ms :     0    21ms
nodes   3112 edges   5351 loops   379    22ms :   379   107ms :   379    54ms
nodes   6031 edges   5518 loops     0     0ms :     0     0ms :     0     8ms
nodes   7929 edges   2756 loops     0     0ms :     0     1ms :     0    10ms
nodes   3841 edges   8340 loops   740    59ms :   740   292ms :   740   158ms
nodes    840 edges    608 loops    12     0ms :    12     1ms :    12     1ms
nodes   9548 edges   8367 loops   134     2ms :   134    39ms :   134    22ms
nodes      6 edges   9525 loops     6     1ms :     6    29ms :     6     7ms
nodes   1185 edges   7226 loops   581   110ms :   581   125ms :   581   288ms
nodes   8411 edges   8406 loops     0     0ms :     0     0ms :     0    11ms
nodes   3179 edges   2630 loops     0     0ms :     0     0ms :     0     5ms

                 . . . and so on . . .

nodes    126 edges    610 loops    52     0ms :    52     1ms :    52     3ms
nodes    758 edges   6658 loops   476    79ms :   476    81ms :   476   243ms
nodes   6553 edges    674 loops     0     0ms :     0     0ms :     0    10ms
nodes   7009 edges   4569 loops     0     1ms :     0     0ms :     0     9ms
nodes   6656 edges    157 loops     0     0ms :     0     0ms :     0    10ms
nodes   7617 edges   6113 loops     0     1ms :     0     0ms :     0    10ms
nodes   2392 edges   8981 loops   804   213ms :   804   378ms :   804   415ms
nodes   6829 edges   4110 loops     0     0ms :     0     0ms :     0     9ms
nodes   1899 edges   9336 loops   922   292ms :   922   346ms :   922   556ms
ant ms 2842 russcox ms 6680 rhundt ms 7368
ant 38% russcox 90% rhundt 100%

In this test Russel's implementation counted the loops in 90% of the time it took Robert's implementation to count the loops in the same graphs. But my implementation took only 38% of the time. My code wins. I think benchmarks can be a pretty accurate measure of performance, now I come to think about it.

I have my tongue in my cheek, in case you didn't know. I mean, these results are genuine, but I'm still wary of benchmarks. Just to confirm my code's superiority I changed max_graph_size to 50000 and reran the test. Here are the last few lines of output

                 . . . and so on . . .
nodes  29381 edges  11279 loops     0     1ms :     0     3ms :     0    42ms
nodes   8736 edges  46720 loops  3114 17774ms :  3114 28904ms :  3114  9482ms
nodes  24733 edges  26201 loops    26     4ms :    26    11ms :    26    51ms
nodes  13453 edges  35531 loops  3800  9737ms :  3800 22393ms :  3800  5261ms
nodes  40050 edges  42545 loops  1445   473ms :  1445 22515ms :  1445   704ms
nodes   9523 edges  35465 loops  4498 18358ms :  4498 22415ms :  4498  9013ms
nodes   8233 edges   1458 loops     0     0ms :     0     1ms :     0    15ms
nodes  39426 edges  13725 loops     0     2ms :     0     5ms :     0    56ms
nodes  23209 edges  46433 loops  3839  5555ms :  3839 41132ms :  3839  3542ms
nodes  20855 edges  31847 loops  1462   362ms :  1462  9499ms :  1462   565ms
nodes  19441 edges  19568 loops     0     1ms :     0     2ms :     0    31ms
nodes  22524 edges  30654 loops  1701   765ms :  1701 14722ms :  1701   907ms
nodes  16581 edges  43258 loops  4331 16599ms :  4331 50270ms :  4331  7703ms
nodes   8926 edges  40470 loops  4035 24388ms :  4035 34164ms :  4035 11233ms
nodes  41346 edges  27901 loops     0     2ms :     0     6ms :     0    58ms
nodes  11934 edges  16560 loops   944   151ms :   944  2433ms :   944   305ms
nodes  23446 edges  16700 loops     0     1ms :     0     3ms :     0    34ms
nodes  23147 edges  16770 loops     0     2ms :     0     3ms :     0    34ms
nodes   1601 edges  43744 loops  1487  4655ms :  1487  4644ms :  1487  5378ms
nodes   8003 edges  15245 loops  1144   179ms :  1144  1550ms :  1144   335ms
ant ms 312512 russcox ms 755303 rhundt ms 177003
ant 176% russcox 426% rhundt 100%

In this test my code took twice as long as Robert's and Russel's code took twice as long as mine! Robert's code wins. I'm going off benchmarks again.

Three different benchmarks. Three different winners. It would be interesting to analyse the performance characteristics of these three implementations of the same algorithm to understand what's happening. But that's for another time.

Is it readable?

From the user's perspective code readability is neither here nor there. But in my experience from the programmer's perspective, you are more likely to achieve some of the things the user does care about, like being correct, if you can reason about the code. And to reason about the code you need to understand it. And code readability aids understanding. I believe the way the code looks is of lesser importance than other goals, but it isn't of no importance.

The core analyze_loops algorithm in the Havlak paper (reproduced in the Google paper) is written in a mathematical pseudocode. It is 30 lines long. I think there is a fairly clear resemblance to that pseudocode in my implementation, which is 53 lines long. One line of pseudocode to two of real code is a pretty good ratio, I think. (Note: I don't count blank lines, comments or lines containing only a curly brace.)

In general I think the resemblance between someone's description of an algorithm and the concrete realisation of that algorithm in a real programming language that actually works on a real computer is a good thing. Sometimes the algorithm description is clear but the computer language doesn't allow the realisation to reflect it.

It may be that computer scientists who publish algorithms often describe them in a way suited to implementation in C++ because they all use C or C++ to develop their algorithms. I don't know.

In what way is an algorithm "readable?" Some algorithms are simple and you can just see what they do without much difficulty. And some you can't keep enough in your head to just "see." In that case you resort to pencil and paper and boxes joined to other boxes with arrows. You work through simple concrete examples following each step in the algorithm to see how it transforms the data. That's what I do anyway.

Candid conclusion

The code is short, but I didn't find it easy to write. I bought the Havlak paper for $15 from the ACM, but I didn't find it easy to move from the abstract description of the complex algorithm in the paper to the concrete representation of a working program; without the two Google examples by Robert Hundt and Russel Cox I may not have succeeded. I do not intend anything I've said here to imply any criticism of those two guys' work; I don't know them personally but I have no doubt they have a far deeper understanding of the Havlak algorithm than I do. But I suspect they were not very interested in the C++ implementation of this algorithm, perhaps because C++ is not a cool language. (Update: I regret saying this last bit; how do I know how interested they were? It was a snide comment. Sorry.)

One of my difficulties was a certain amount of confusion about when I should be using the basic block name, an int, and when the preorder number, also an int. Using the two typedefs for these two concepts helped me keep them separate.

The most interesting thing I learned was the UNION/FIND algorithm, which I'm sad to say, I had never before seen (or if I had I had forgotten). That path compression is really neat.

In the above narrative I may have made it sound like I implemented the algorithm, checked it against the Google results and found it was correct. This is not how it happened. I fiddled around for some time getting wrong answers and trying to figure out what mistakes I had made. So what? I believe a poet might work on a poem for years before it's published. Right matters. Right first time is overrated. Since I'm not 100% sure this code is right I'm going to stop pontificating right here.

index of blog posts

5 comments:

bertie woosterJuly 5, 2012 at 4:41 PM
Congratulations for this really great job !

Our congratulations are particulary sincere for we have (almost) been through the same experiment :
reimplementing the Hundt's C++ version.
Our approach was different, we are developping a new langage (a new hammer ;-), so we wanted to produce exact "cut and past" of Hundt's version : "you push back an element ? ok, we do it too, next move ?", and so on...
but with our own langage/STL...
We even copied the getters/setters functions.
We were not really interested in the algorithm efficiency itself but by the comparison between C++ and Cawen.
The results are here : http://melvenn.com/en/home/
We don't know if C++ is cool or not ;-), but we will be very happy to implement your own version as soon as we find some time to do it !

TS & GC
AnonymousJuly 14, 2012 at 2:59 PM
He he he, the quote is excellent.
It reminds me of this one : "The nice thing about standards is that there are so many of them to choose
from."
But, please believe that we did not do too much harm to the software world, since
1) Cawen is not exactly a new langage...It should better be considered as a way to template (struct and functions) good old C (who has never been cool but does not mind, for he knows he is the King ;-)...It's some kind of "back to the future VI" language...You can download the C99 sources the Cawen preprocessor has generated here :http://www.melvenn.com/data/havlak_cawen_benchmark.tgz

2) may be no one will never hear about it expect you, R.Hundt, my friend Thomas and me.

regards,
Gwenael Chailleu
RobertSeptember 20, 2012 at 6:44 PM
Great Job, Anthony, just saw this now...

Regards
Robert Hundt

About Me

How to Write a Program

09 October 2011

Loop counting: Google's multi-language-bench

Is it correct?

Is it efficient?

Is it readable?

Candid conclusion

5 comments: