C performance notes

A few things I’ve learned while auditing open-source projects.

Be careful when using `calloc`

These days, it’s common understanding not to worry much about useless writes. For small, simple cases, this is great advice. People sometimes make risky decisions to omit a few store instructions, and modern compilers are often smart enough to identify dead assignments. Moreover, trying to shave away needless initializations usually is rarely a valuable time investment.

calloc, however, can be a notable exception. As you likely know, calloc zeroes the memory that it allocates. It’s probably linked into your binary dynamically, so the compiler can’t identify its writes as useless in IR- or assembly-level optimizations.

Clang 3.7 with -O2 can reduce a trivial usage like this:

p = calloc(1, sizeof(long));
return p[0];

To return 0. However, a typical usage of the form:

p = calloc(n, sizeof(struct huge_struct));
memcpy(p, p2, n * sizeof(struct huge_struct));

Will call the stdlib’s calloc function, therefore including the useless zeroing. Writes of this type can be tens of kilobytes - a big performance hit if used in a hot path.

The compiler doesn’t have an easy out, either. While there are platform-specific functions like OpenBSD’s reallocarray, there is no POSIX- or ANSI-specified calloc relative that doesn’t zero its memory. The compiler could add an overflow check and then call malloc, but it’s then responsible for the correctness and the size of this silently injected code.

So, avoid calloc when immediately overwriting the allocated memory. Instead, use malloc if the number of elements is 1 and your platform’s reallocarray equivalent otherwise. If there isn’t one, you can always include the file from OpenSSH’s portability code.

Extending this across libc

The above calloc example exposes a more general weakness in modern C. While compiler optimizations at the basic block and function level are very strong, it’s amazing how little Clang[1] leverages the libc API for optimization. While there are a few complications, it seems like a fruitful approach.

To this effect, I’m writing a strnlen(3) optimization for Clang. Hopefully I’ll be able to write similar optimizations for other string.h functions. And hopefully I’ll have time to share details of the experience and related advice here.

I haven’t had a chance to investigate how GCC does this. What I’ve heard about its architecture and code quality makes me hesitant… ↩

C performance notes

Be careful when using calloc

Extending this across libc

Be careful when using `calloc`