25 March 2009

bytefluo

I was reading Gulliver’s Travels to one of my children tonight and we came across the term Big-endian in a passage about the reason for the war between Lilliputians and Blefuscudians (the latter crack their eggs at the big end, the former at the little end). Although I read Gulliver’s Travels many years ago I had obviously forgotten this because I assumed that big- and little-endian were just computer jargon. But it seems Swift coined the term.

It is computed that eleven thousand persons have at several times suffered death, rather than submit to break their eggs at the smaller end. Many hundred large volumes have been published upon this controversy: but the books of the Big-endians have been long forbidden, and the whole party rendered incapable by law of holding employments. During the course of these troubles, the emperors of Blefusca did frequently expostulate by their ambassadors, accusing us of making a schism in religion, by offending against a fundamental doctrine of our great prophet Lustrog, in the fifty-fourth chapter of the Blundecral (which is their Alcoran). This, however, is thought to be a mere strain upon the text; for the words are these: 'that all true believers break their eggs at the convenient end.'

Gulliver’s Travels, Chapter 4

Anyway, following my previous post about UUID and Byte Order I wanted to advertise a way to break your eggs at the convenient end, namely a C++ class called bytefluo. Here’s how you might use it to read big- and little-endian UUIDs




#include "bytefluo.h"

namespace RFC_4122 {
// types borrowed from RFC 4122

typedef unsigned long unsigned32;
typedef unsigned short unsigned16;
typedef unsigned char unsigned8;
typedef unsigned char byte;

typedef struct {
unsigned32 time_low;
unsigned16 time_mid;
unsigned16 time_hi_and_version;
unsigned8 clock_seq_hi_and_reserved;
unsigned8 clock_seq_low;
byte node[6];
} uuid_t;
}//namespace RFC_4122

namespace {
// read UUID structure from given bytefluo stream
RFC_4122::uuid_t read_common(bytefluo & buf)
{
// note that this code will read the data in the correct byte order
// regardless of the natural byte order of this system; also reading
// the structure one field at a time makes it immune to possible
// compiler-specific structure packing for machine word alignment
RFC_4122::uuid_t result;
buf >> result.time_low
>> result.time_mid
>> result.time_hi_and_version
>> result.clock_seq_hi_and_reserved
>> result.clock_seq_low;
buf.read(result.node, sizeof(result.node));
return result;
}
}//anonymous namespace

// read big-endian encoded UUID structure from given 16-byte buffer
RFC_4122::uuid_t uuid_from_16_bytes_big_endian(const unsigned char * bytes)
{
bytefluo buf(bytes, bytes + 16, bytefluo::big_endian);
return read_common(buf);
}

// read little-endian encoded UUID structure from given 16-byte buffer
RFC_4122::uuid_t uuid_from_16_bytes_little_endian(const unsigned char * bytes)
{
bytefluo buf(bytes, bytes + 16, bytefluo::little_endian);
return read_common(buf);
}

int main()
{
const unsigned char bytes[16] = {
0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77,
0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF
};
using RFC_4122::uuid_t;

// first read raw data assuming big-endian byte ordering
{
const uuid_t uuid(uuid_from_16_bytes_big_endian(bytes));
assert(uuid.time_low == 0x00112233);
assert(uuid.time_mid == 0x4455);
assert(uuid.time_hi_and_version == 0x6677);
assert(uuid.clock_seq_hi_and_reserved == 0x88);
assert(uuid.clock_seq_low == 0x99);
assert(::memcmp(uuid.node, "\xAA\xBB\xCC\xDD\xEE\xFF", 6) == 0);
}

// now read the same raw data assuming little-endian byte ordering
{
const uuid_t uuid(uuid_from_16_bytes_little_endian(bytes));
assert(uuid.time_low == 0x33221100);
assert(uuid.time_mid == 0x5544);
assert(uuid.time_hi_and_version == 0x7766);
assert(uuid.clock_seq_hi_and_reserved == 0x88);
assert(uuid.clock_seq_low == 0x99);
assert(::memcmp(uuid.node, "\xAA\xBB\xCC\xDD\xEE\xFF", 6) == 0);
}
}

Notice how bytefluo abstracts away the details of reading big- and little-endian scalars? Get your copy here!


 

index of blog posts

13 March 2009

UUID and Byte Order

Say you have a UUID stored in memory as an array of 16 contiguous bytes like this

const unsigned char bytes[16] = {
    0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77,
    0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF
};

And say you need a function that can take a pointer to the first byte of such an array and return the equivalent curly-bracketed string representation of that UUID

std::string uuid_string(const unsigned char *);

You know the function should return a string something like this
    {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}

But do you know precisely what the string should be for the array given above?

UUIDs are documented in RFC 4122. Here is an extract from that RFC:
In the absence of explicit application or presentation protocol
specification to the contrary, a UUID is encoded as a 128-bit object,
as follows:

The fields are encoded as 16 octets, with the sizes and order of the
fields defined above, and with each field encoded with the Most
Significant Byte first (known as network byte order).  Note that the
field names, particularly for multiplexed fields, follow historical
practice.

0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          time_low                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       time_mid                |         time_hi_and_version   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|clk_seq_hi_res |  clk_seq_low  |         node (0-1)            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         node (2-5)                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

So, assuming our 16-byte array is encoded in the format recommended in RFC 4122, each field will be stored most-significant byte first and our function should return the string
    {00112233-4455-6677-8899-AABBCCDDEEFF}

However...

“In the absence of explicit application or presentation protocol specification to the contrary...”, so in other words you cannot write a universally applicable 16-byte array to UUID string function because the encoding of the UUID has not been mandated by the RFC.

RFC 4122 may be perfectly clear, but I believe the freedom it permits could cause confusion. Here is an example where the default encoding described in RFC 4122 was not adopted:
Although RFC 4122 recommends network byte order for all fields, the PC industry (including the ACPI, UEFI, and Microsoft specifications) has consistently used little-endian byte encoding for the first three fields: time_low, time_mid, time_hi_and_version. The same encoding, also known as wire format, should also be used for the SMBIOS representation of the UUID. System Management BIOS (SMBIOS) Specification v2.6

But the UUID field was introduced in SMBIOS specification version 2.1 and the above quoted clarification was absent from that specification; one was just told “UUID; 16 BYTEs” (and something about what all FFs and all 00s meant) and left to get on with it. And you would have been wrong to read those 16 bytes assuming they were big-endian. Had you done so you would at some point have found that your UUID for a given computer did not match the UUID someone else obtained for that same computer via some other means, say WMI. And you might have felt both annoyed and embarrassed about it; embarrassed that you didn’t make more effort to check that your interpretation was correct in the first place and annoyed that the documentation was not more explicit - so annoyed that you’d feel the need to blog about how unfair it all was.

To summarise: if the encoding is big-endian, as described in RFC 4122, the UUID represented by the bytes
const unsigned char bytes[16] = {
    0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77,
    0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF
};

will be
    {00112233-4455-6677-8899-AABBCCDDEEFF}

But if the encoding is little-endian, as described in SMBIOS V2.6, the same 16 bytes will represent the UUID
    {33221100-5544-7766-8899-AABBCCDDEEFF}

Perhaps your system uses some other UUID encoding...


UPDATE June 2012

When I read the note in the SMBIOS v2.6 specification I interpreted it not as saying the format was changing from big- to little-endian, but as saying that the UUID had always been stored little-endian, but this had hitherto not been explicitly stated. But now it has been explicitly stated that the value is stored little-endian what should you do if you've got code out in the field that's written assuming it was stored big-endian? I think these are your choices:

a) You could leave your code as it is. But that would mean your code may get a different value to that obtained through another means, such as WMI. This is what I described above. The upside is that your code is consistent, which may or may not matter in your application.

b) You could change your code so that it always reads the UUID in little-endian byte order. This is may make it more likely to be consistent with the value obtained through other means. But it does mean your code is not consistent, which may or may not matter in your application.

c) You could sometimes read it little-endian and sometimes big-endian, depending on the SMBIOS version number. For example, here is what is done in dmidecode:


    from dmidecode.c:
    /*
     * As off version 2.6 of the SMBIOS specification, the first 3
     * fields of the UUID are supposed to be encoded on little-endian.
     * The specification says that this is the defacto standard,
     * however I've seen systems following RFC 4122 instead and use
     * network byte order, so I am reluctant to apply the byte-swapping
     * for older versions.
     */
    if (ver >= 0x0206)
        printf("%02X%02X%02X%02X-%02X%02X-%02X%02X-%02X%02X-%02X%02X%02X%02X%02X%02X",
            p[3], p[2], p[1], p[0], p[5], p[4], p[7], p[6],
            p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]);
    else
        printf("%02X%02X%02X%02X-%02X%02X-%02X%02X-%02X%02X-%02X%02X%02X%02X%02X%02X",
            p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7],
            p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]);


It doesn't seem right to me that the value of the SMBIOS UUID should depend on the programmer's judgement in this way.

Even after you get past this byte-order issue, there are other problems with the SMBIOS UUID you should be aware of.

index of blog posts

08 March 2009

Test for a Power of 2

A while ago I wanted to test whether a given number was a power of 2. The code was C++ and the number was a long.


bool is_power_of_2(long);


assert(is_power_of_2(8));

assert(!is_power_of_2(9));


I had a feeling that there was a simple way to test for this property – after all, it just means that it has exactly one bit set doesn’t it? But I couldn’t think of an elegant way to test for this. I ended up using a function that returned the number of bits set, like this:


int total_bits_set(long);


// return true iff ‘n’ is a power of 2

bool is_power_of_2(long n)

{

    return total_bits_set(n) == 1;

}


So that worked, and I moved on. But I knew it was not the simplest way, not least because my total_bits_set() code involved shifting, masking and indexing into a 256 byte table four times. (By the way, ‘iff’ means ‘if and only if’; also I realise now that this implementation incorrectly returns true for is_power_of_2(-2147483648).)


Some time later I was surfing around and I stumbled upon Sean Eron Anderson’s site Bit Twiddling Hacks. And there was the answer, which I implemented like so:


// return true iff ‘n’ is a power of 2

bool is_power_of_2(long n)

{

    return n > 0 && (n & (n - 1)) == 0;

}


That was the sort of clever trick I felt sure existed and meant I could throw away the total_bits_set() code. (This algorithm is described in Wikipedia.)


The code needs to work on gcc on various Unix platforms and on MSVC on Windows. On all these platforms a long is implemented as a 32-bit two’s complement value, which means the above trick will work for all of them. But I got to wonder about how portable this code really is. My copy of the ISO C++ standard says 

[...] the signed and unsigned integer types are collectively called integral types. [...] The representations of integral types shall define values by use of a pure binary numeration system. [Example: this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types.]

3.9.1 Fundamental types, Paragraph 7, ISO/IEC 14882:1998


And in a footnote it elaborates on binary numeration systems

A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral power of 2, except perhaps for the bit with the highest position. (Adapted from the American National Dictionary for Information Processing Systems.)


I believe the bit patterns for all three permitted integral type representations are the same for all represented values greater than zero. Therefore the above trick will work correctly on all compliant C++ compilers.


index of blog posts