Notes on Programming in C
Kernighan and Plauger’s The Elements of Programming Style was an important and rightly inﬂuential book. But sometimes I feel its concise rules were taken as a cookbook approach to good style instead of the succinct expression of a philosophy they were meant to be. If the book claims that variable names should be chosen meaningfully, doesn’t it then follow that variables whose names are small essays on their use are even better? Isn’t MaximumValueUntilOverflow a better name than maxval? I don’t think so.
What follows is a set of short essays that collectively encourage a philosophy of clarity in program- ming rather than giving hard rules. I don’t expect you to agree with all of them, because they are opinion and opinions change with the times. But they’ve been accumulating in my head, if not on paper until now, for a long time, and are based on a lot of experience, so I hope they help you understand how to plan the details of a program. (I’ve yet to see a good essay on how to plan the whole thing, but then that’s partly what this course is about.) If you ﬁnd them idiosyncratic, ﬁne; if you disagree with them, ﬁne; but if they make you think about why you disagree, that’s better. Under no circumstances should you program the way I say to because I say to; program the way you think expresses best what you’re trying to accomplish in the program. And do so consistently and ruthlessly.
Your comments are welcome.
Issues of typography
A program is a sort of publication. It’s meant to be read by the programmer, another programmer (perhaps yourself a few days, weeks or years later), and lastly a machine. The machine doesn’t care how pretty the program is — if the program compiles, the machine’s happy — but people do, and they should. Sometimes they care too much: pretty printers mechanically produce pretty output that accentuates irrelevant detail in the program, which is as sensible as putting all the prepositions in English text in bold font. Although many people think programs should look like the
Typographic conventions consistently held are important to clear presentation, of course — indenta- tion is probably the best known and most useful example — but when the ink obscures the intent, typogra- phy has taken over. So even if you stick with plain old
Ah, variable names. Length is not a virtue in a name; clarity of expression is. A global variable rarely used may deserve a long name, maxphysaddr say. An array index used on every line of a loop needn’t be named any more elaborately than i. Saying index or elementnumber is more to type (or calls upon your text editor) and obscures the details of the computation. When the variable names are huge, it’s harder to see what’s going on. This is partly a typographic issue; consider
- 2 -
for(i=0 to 100) array[i]=0
for(elementnumber=0 to 100) array[elementnumber]=0;
The problem gets worse fast with real examples. Indices are just notation, so treat them as such.
Pointers also require sensible notation. np is just as mnemonic as nodepointer if you con- sistently use a naming convention from which np means ‘‘node pointer’’ is easily derived. More on this in the next essay.
As in all other aspects of readable programming, consistency is important in naming. If you call one variable maxphysaddr, don’t call its cousin lowestaddress.
Finally, I prefer
I eschew embedded capital letters in names; to my
The use of pointers.
C is unusual in that it allows pointers to point to anything. Pointers are sharp tools, and like any such tool, used well they can be delightfully productive, but used badly they can do great damage (I sunk a wood chisel into my thumb a few days before writing this). Pointers have a bad reputation in academia, because they are considered too dangerous, dirty somehow. But I think they are powerful notation, which means they can help us express ourselves clearly.
Consider: When you have a pointer to an object, it is a name for exactly that object and no other. That sounds trivial, but look at the following two expressions:
The ﬁrst points to a node, the second evaluates to (say) the same node. But the second form is an expres- sion; it is not so simple. To interpret it, we must know what node is, what i is, and that i and node are related by the (probably unspeciﬁed) rules of the surrounding program. Nothing about the expression in isolation can show that i is a valid index of node, let alone the index of the element we want. If i and j and k are all indices into the node array, it’s very easy to slip up, and the compiler cannot help. It’s partic- ularly easy to make mistakes when passing things to subroutines: a pointer is a single thing; an array and an index must be believed to belong together in the receiving subroutine.
An expression that evaluates to an object is inherently more subtle and
If we want the next element’s type, it’s
- 3 -
i advances but the rest of the expression must stay constant; with pointers, there’s only one thing to advance.
Typographic considerations enter here, too. Stepping through structures using pointers can be much easier to read than with expressions: less ink is needed and less effort is expended by the compiler and computer. A related issue is that the type of the pointer affects how it can be used correctly, which allows some helpful
is sufﬁciently evocative; if an array is being indexed the array will have some
Again, the extra characters become more irritating as the examples become larger.
As a rule, if you ﬁnd code containing many similar, complex expressions that evaluate to elements of a data structure, judicious use of pointers can clear things up. Consider what
would look like using a compound expression for p. Sometimes it’s worth a temporary variable (here p) or a macro to distill the calculation.
Procedure names should reﬂect what they do; function names should reﬂect what they return. Func- tions are used in expressions, often in things like if’s, so they need to read appropriately.
is unhelpful because we can’t deduce whether checksize returns true on error or
makes the point clear and makes a future mistake in using the routine less likely.
A delicate matter, requiring taste and judgement. I tend to err on the side of eliminating comments, for several reasons. First, if the code is clear, and uses good type names and variable names, it should explain itself. Second, comments aren’t checked by the compiler, so there is no guarantee they’re right, especially after the code is modiﬁed. A misleading comment can be very confusing. Third, the issue of typography: comments clutter code.
But I do comment sometimes. Almost exclusively, I use them as an introduction to what follows. Examples: explaining the use of global variables and types (the one thing I always comment in large pro- grams); as an introduction to an unusual or critical procedure; or to mark off sections of a large computa- tion.
There is a famously bad comment style:
/* Add one to i */
and there are worse ways to do it:
- 4 -
Add one to i
Don’t laugh now, wait until you see it in real life.
Avoid cute typography in comments, avoid big blocks of comments except perhaps before vital sec- tions like the declaration of the central data structure (comments on data are usually much more helpful than on algorithms); basically, avoid comments. If your code needs a comment to be understood, it would be better to rewrite it so it’s easier to understand. Which brings us to
Most programs are too complicated — that is, more complex than they need to be to solve their prob- lems efﬁciently. Why? Mostly it’s because of bad design, but I will skip that issue here because it’s a big one. But programs are often complicated at the microscopic level, and that is something I can address here.
Rule 1. You can’t tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don’t try to second guess and put in a speed hack until you’ve proven that’s where the bottleneck is.
Rule 2. Measure. Don’t tune for speed until you’ve measured, and even then don’t unless one part of the code overwhelms the rest.
Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don’t get fancy. (Even if n does get big, use Rule 2 ﬁrst.) For example, binary trees are always faster than splay trees for workaday problems.
Rule 4. Fancy algorithms are buggier than simple ones, and they’re much harder to implement. Use simple algorithms as well as simple data structures.
The following data structures are a complete list for almost all practical programs:
array linked list hash table binary tree
Of course, you must also be prepared to collect these into compound data structures. For instance, a sym- bol table might be implemented as a hash table containing linked lists of arrays of characters.
Rule 5. Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be
Rule 6. There is no Rule 6.
Programming with data.
Algorithms, or details of algorithms, can often be encoded compactly, efﬁciently and expressively as data rather than, say, as lots of if statements. The reason is that the complexity of the job at hand, if it is due to a combination of independent details, can be encoded. A classic example of this is parsing tables, which encode the grammar of a programming language in a form interpretable by a ﬁxed, fairly simple piece of code. Finite state machines are particularly amenable to this form of attack, but almost any pro- gram that involves the ‘parsing’ of some abstract sort of input into a sequence of some independent ‘actions’ can be constructed proﬁtably as a
Perhaps the most intriguing aspect of this kind of design is that the tables can sometimes be gen- erated by another program — a parser generator, in the classical case. As a more earthy example, if an
- 5 -
operating system is driven by a set of tables that connect I/O requests to the appropriate device drivers, the system may be ‘conﬁgured’ by a program that reads a description of the particular devices connected to the machine in question and prints the corresponding tables.
One of the reasons
Another result of the tyranny of Pascal is that beginners don’t use function pointers. (You can’t have
Some of the complexity is passed to the routine pointed to. The routine must obey some standard protocol — it’s one of a set of routines invoked identically — but beyond that, what it does is its business alone. The complexity is distributed.
There is this idea of a protocol, in that all functions used similarly must behave similarly. This makes for easy documentation, testing, growth and even making the program run distributed over a net- work — the protocol can be encoded as remote procedure calls.
I argue that clear use of function pointers is the heart of
Simple rule: include ﬁles should never include include ﬁles. If instead they state (in comments or implicitly) what ﬁles they need to have included ﬁrst, the problem of deciding which ﬁles to include is pushed to the user (programmer) but in a way that’s easy to handle and that, by construction, avoids multi- ple inclusions. Multiple inclusions are a bane of systems programming. It’s not rare to have ﬁles included ﬁve or more times to compile a single C source ﬁle. The Unix /usr/include/sys stuff is terrible this way.
There’s a little dance involving #ifdef’s that can prevent a ﬁle being read twice, but it’s usually done wrong in practice — the #ifdef’s are in the ﬁle itself, not the ﬁle that includes it. The result is often thousands of needless lines of code passing through the lexical analyzer, which is (in good compilers) the most expensive phase.
Just follow the simple rule.