15 January 2010

Roman Numbers

I was dipping into the literate source of Donald Knuth's TeX.web when I came upon this



The TeX.web source that generates this snippet looks like Listing A.

Listing A:
@ Roman numerals are produced by the |print_roman_int| routine.  Readers
who like puzzles might enjoy trying to figure out how this tricky code
works; therefore no explanation will be given. Notice that 1990 yields
\.{mcmxc}, not \.{mxm}.

@p procedure print_roman_int(@!n:integer);
label exit;
var j,@!k: pool_pointer; {mysterious indices into |str_pool|}
@!u,@!v: nonnegative_integer; {mysterious numbers}
begin j:=str_start["m2d5c2l5x2v5i"]; v:=1000;
loop@+ begin while n>=v do
begin print_char(so(str_pool[j])); n:=n-v;
end;
if n<=0 then return; {nonpositive input produces no output}
k:=j+2; u:=v div (so(str_pool[k-1])-"0");
if str_pool[k-1]=si("2") then
begin k:=k+2; u:=u div (so(str_pool[k-1])-"0");
end;
if n+u>=v then
begin print_char(so(str_pool[k])); n:=n+u;
end
else begin j:=j+2; v:=v div (so(str_pool[j-1])-"0");
end;
end;
exit:end;



I converted Knuth's original Pascal more-or-less directly into C++, except that this version returns the Roman numerals as a string rather than printing them, which makes it referentially transparent and easier to test.

// converted from Donald Knuth's Pascal print_roman_int procedure in TeX.web
std::string roman_int(int n)
{
std::string result;
const char * j = "m2d5c2l5x2v5i";
int v = 1000;
for (;;) {
while (n >= v) {
result += *j;
n = n - v;
}
if (n <= 0)
break; // nonpositive input produces no output
const char * k = j + 2;
int u = v / (k[-1] - '0');
if (k[-1] == '2') {
k = k + 2;
u = u / (k[-1] - '0');
}
if (n + u >= v) {
result += *k;
n = n + u;
}
else {
j = j + 2;
v = v / (j[-1] - '0');
}
}
return result;
}


The wikipedia article on Roman numerals shows this table:

×1×2×3×4×5×6×7×8×9
OnesIIIIIIIVVVIVIIVIIIIX
TensXXXXXXXLLLXLXXLXXXXC
HundredsCCCCCCCDDDCDCCDCCCCM
ThousandsMMMMMMIVVVIVIIVIIIIX
Ten thousandsXXXXXXXLLLXLXXLXXXXC
Hundred thousandsCCCCCCCDDDCDCCDCCCCM


From this table a pattern is fairly easy to see and I wrote my own version shown in Listing B (like Knuth I didn't include over-barred numerals, though they would be just as easy to add).

Listing B:
// return lower-case Roman numeral equivelant of given 'num'
std::string roman_int_ii(int num)
{
struct pair { const char * const symbol; int value; };
const pair table[] = {
{"m", 1000}, {"cm", 900}, {"d", 500}, {"cd", 400}, {"c", 100}, {"xc", 90},
{"l", 50}, {"xl", 40}, {"x", 10}, {"ix", 9}, {"v", 5}, {"iv", 4}, {"i", 1}
};
std::string result;
for (const pair * p = table; num > 0; ++p) {
while (num >= p->value) {
result += p->symbol;
num -= p->value;
}
}
return result;
}


And then I checked my version against Knuth's and found the two functions produce identical results:

int main()
{
for (int i = 0; i < 5000; ++i) {
const std::string a(roman_int(i));
const std::string b(roman_int_ii(i));
if (a != b) {
std::cout
<< "Doesn't work for " << i
<< " got '" << b
<< "' instead of '" << a << "'\n";
}
}
}


I enjoy reading bits and pieces of Knuth's literate code, such as TeX and Adventure. Knuth's commentary can be interesting and the code looks quite beautiful in its typeset form. But I'd prefer to be working with code like that in Listing B rather than that in Listing A, principally because I can readily understand B whereas A requires a lot more effort. I guess, as he hints in the comment, in this case he just wanted to have fun with some tricky code; he’s produced a kind of domain-specific language interpreter in a dozen or so lines of code. (I love code.) But it’s not just the choice of algorithm I’m uncomfortable with, it’s the WEB source code itself.

I have not written any code using Knuth’s literate programming tools but from what I’ve seen of what others have written I think that WEB sources are less readable than plain sources.

Here is a fragment of the CWEB source for Adventure:

@ Here is our overall strategy for administering the game. It is understood
that the program might |goto quit| from within any of the subsections
named here, even though the section names don't mention this explicitly.
For example, while checking for interference we might find out that
time has run out, or that a dwarf has killed you and no more
reincarnations are possible.

The execution consists of two nested loops: There are ``minor
cycles'' inside of ``major cycles.'' Actions define minor cycles in which
you stay in the same place and we tell you the result of your action.
Motions define major cycles in which you move and we tell you what you
can see at the new place.

@<Simulate...@>=
while (1) {
@<Check for interference with the proposed move to |newloc|@>;
loc=newloc; /* hey, we actually moved you */
@<Possibly move dwarves and the pirate@>;
commence: @<Report the current state@>;
while (1) {
@<Get user input; |goto try_move| if motion is requested@>;
@<Perform an action in the current place@>;
}
try_move: @<Handle special motion words@>;
oldoldloc=oldloc;
oldloc=loc;
go_for_it: @<Determine the next location, |newloc|@>;
}

@ Our main task in the simulation loop is to parse your input.
Depending on the kind of command you give, the following section
of the program will exit in one of four ways:

\smallskip
\item{$\bullet$} |goto try_move| with |mot| set to a desired motion.

\item{$\bullet$} |goto transitive| with |verb| set to a desired action
and |obj| set to the object of that motion.

\item{$\bullet$} |goto intransitive| with |verb| set to a desired action
and |obj=NOTHING|; no object has been specified.

\item{$\bullet$} |goto speakit| with |hash_table[k].meaning| the
index of a message for a vocabulary word of |message_type|.

\smallskip\noindent
Sometimes we have to ask you to complete an ambiguous command before we
know both a verb and its object. In most cases the words can be in
either order; for example, \.{take} \.{rod} is equivalent to \.{rod} \.{take}.
A~motion word overrides a previously given action or object.

Lots of special cases make the program a bit messy. For example,
if the verb is \.{say}, we don't want to look up the object in our
vocabulary; we simply want to ``say'' it.

@<Get user input...@>=
verb=oldverb=ABSTAIN;
oldobj=obj;
obj=NOTHING;
cycle: @<Check if a hint applies, and give it if requested@>;
@<Make special adjustments before looking at new input@>;
listen();
pre_parse: turns++;
@<Handle special cases of input@>;
@<Check the clocks and the lamp@>;
@<Handle additional special cases of input@>;
parse:@<Give advice about going |WEST|@>;
@<Look at |word1| and exit to the right place if it completes a command@>;
shift: strcpy(word1,word2);@+*word2='\0';@+goto parse;


The way the code is kind of lost in all the words and CWEB command codes and verbose section names, the way the structure is lost when some code is jammed up to the left margin, the way all the variables seem to end up global (how do you keep track of scope?), I just don’t like. And those goto labels all over the place! I really don’t know how you keep the structure in your head. I think I must be very visual: I need to see the structure on the screen.

Perhaps I’m mixing up all literate programming with the specific examples I’ve looked at; and anyway I might find it just as hard to read Mr. Knuth’s code if it was written in plain C. It’s clear that we are all different and we don’t all share the same coding style. And thank goodness for that! As I say, I haven’t worked with any literate code, just read some. So my opinions are just first impressions. Perhaps if the reason for the code is primarily to be typeset for exposition, the ugliness of the source is a price worth paying. But for ordinary working-class code I'd rather read and write plain idiomatic native code - but code written with care.


index of blog posts

No comments:

Post a Comment