How to Write a Program: Write a C function taking two strings...

A guy called Tommy M. McGuire attended an interview with Google and was asked to write some code on the whiteboard:

My big question: "Write a C function taking two strings and returning a string containing one copy of the characters that occur in both." Hint: Spend some time looking at one of the strings and use an array of flags.

Given that hint, I guess we can assume the strings contain only 8-bit single-byte characters and the sort of thing the Google interviewers might have been looking for was something like this:

// return one copy of all chars that occur in both 'a' and 'b'
// e.g. common_chars("common", "man") -> "mn"
std::string common_chars(const char * a, const char * b)
{
    std::string result;
    char occurs_in_a[UCHAR_MAX + 1] = {0};
    // set occurs_in_a[c] to 1 for all values of c that occur in string a
    while (*a)
        occurs_in_a[static_cast<unsigned char>(*a++)] = 1;
    // check every character in b to see if it also occurs in a
    for (; *b; ++b) {
        if (occurs_in_a[static_cast<unsigned char>(*b)]) {
            result += *b; // add char to the result
            occurs_in_a[static_cast<unsigned char>(*b)] = 0; // never add it again
        }
    }
    return result;
}

Yes, I wrote it in C++. (I'm the kind of guy Linus Torvalds loves to hate and Peter Seibel doesn't understand.) So how would we get the result back to the caller in C? I can think of about a dozen ways, all of which are also available to me in C++, and all of which are more tedious and error-prone both for the implementation and for the callers of the function. Of course, if it made sense to use one of those methods in the context in which this function was to operate, then I would use it.

Going back to the whiteboard: if I was standing in front of a hyper-intelligent Googler I would probably wet my pants and I suspect I would have written the function differently: I wouldn’t have commented it; I probably would have written 256 instead of UCHAR_MAX + 1; more seriously, I may have forgotten to cast away the (possible) sign of the char when used as an array index. All of these things would have occurred to me only after I'd left the building. (And, since the requirement was for a C function, I guess I would have had to fiddle around with buffers and buffer sizes and null terminators and shit.)

In the real world I would unit test this function with something like this:

// return true iff 'a' and 'b' contain exactlty the same characters
// ignoring the character ordering; e.g. same("ab", "ba") -> true
bool same(const std::string & a, const std::string & b)
{
    std::vector<char> va(a.c_str(), a.c_str() + a.size());
    std::sort(va.begin(), va.end());
    std::vector<char> vb(b.c_str(), b.c_str() + b.size());
    std::sort(vb.begin(), vb.end());
    return va == vb;
}


int main()
{
    char all_chars[256];
    for (int i = 0; i < 256; ++i)
        all_chars[i] = static_cast<unsigned char>(i + 1);
    
    struct test { const char * a; const char * b; const char * expected; };
    const test table[] = {
        {"",                "",                 ""},
        {"a",               "",                 ""},
        {"",                "a",                ""},
        {"a",               "b",                ""},
        {"a",               "a",                "a"},
        {"a",               "aaa",              "a"},
        {"aaa",             "a",                "a"},
        {"a",               "ab",               "a"},
        {"b",               "ab",               "b"},
        {"ab",              "a",                "a"},
        {"ab",              "b",                "b"},
        {"ab",              "ab",               "ab"},
        {"ab",              "ba",               "ab"},
        {"ab",              "cat",              "a"},
        {"ab",              "abba",             "ab"},
        {"abba",            "abba",             "ab"},
        {"aabbcc",          "cab",              "abc"},
        {"cab",             "abcabc",           "abc"},
        {"common",          "man",              "mn"},
        {"man",             "common",           "mn"},
        {"fox",             "in socks",         "o"},
        {"play",            "misty",            "y"},
        {"the rain",        "in spain",         " ain"},
        {"don't you know",  "you'll stain the", "you' tn"},
        {"Transmission",    "C++",              ""},
        {all_chars,         "ABC",              "ABC"},
        {"ABC",             all_chars,          "ABC"},
        {all_chars,         all_chars,          all_chars},
        
        {0, 0, 0} // end of table
    };
    int tests(0), failures(0);
    for (const test * t = table; t->a; ++t) {
        ++tests;
        const std::string result(common_chars(t->a, t->b));
        if (!same(result, t->expected)) {
            ++failures;
            std::cout
                << "Failed for f('" << t->a
                << "', '" << t->b
                << "') expected '" << t->expected
                << "' got '" << result
                << "'\n";
        }
    }
    std::cout
        << "total tests " << tests
        << ", total failures " << failures
        << "\n";

    return failures ? EXIT_FAILURE : EXIT_SUCCESS;
}

I've been writing code for a long time - long enough to know that I make mistakes. I have found unit tests to be an excellent way to pick up some of those mistakes as well as being useful for other reasons I'll write about another time.

Of course, passing a bunch of unit tests does not mean the code doesn't have horrible bugs in it. For example, I tested this code under a compiler where chars are signed. If you remove the static_cast<unsigned char>() casts the unit tests still all pass. What this means is that when *a is '\x80' the expression occurs_in_a[*a++] = 1 is similar to occurs_in_a[-128] = 1 and this sets a byte to 1 on the stack 128 bytes before the occurs_in_a array. But that's alright (NOT!) because if (occurs_in_a[*b]) will look for it in the same place, and find it there, which is why the tests still pass. But the stack is now corrupted, the consequences of which may be nil or catastrophic, immediate or postponed.

index of blog posts

4 comments:

UnknownNovember 30, 2010 at 11:18 PM
You clearly know way more C++ than I do. Is there a reason you used a char array (occurs_in_a) rather than a STL set?
Anthony C. HayDecember 1, 2010 at 11:01 AM
Hello Tommy. It's nice, if weird, to 'meet' you.

No reason, I guess I'm just old-school. (-: If we were supporting Unicode maybe a set would be the way to go.

It's a hell of a lot easier to pick a question that takes your fancy and spend as long as you feel like thinking about it than to perform in an interview.
AnonymousApril 7, 2011 at 3:44 PM
It's C++. Question was about C function!
Anthony C. HayApril 7, 2011 at 6:25 PM
You are quite right. I happened upon Tommy's post and I thought how would I do that in the language I use most (C++) and how would I do it in C (a language I haven't used for years); and I thought std::string takes care of a lot of housekeeping that's error-prone in C and why do people hate C++ and what would it be like to be interviewed at Google? And it ended up here and I get to talk to people all over the world, which is nice.

About Me

How to Write a Program

14 February 2010

Write a C function taking two strings...

4 comments: