Cosmopolitan makes C a build-once run-anywhere language, similar to Java, except it doesn't require interpreters or virtual machines be installed beforehand. Cosmo provides the same portability benefits as high-level languages like Go and Rust, but it doesn't invent a new language and you won't need to configure a CI system to build separate binaries for each operating system. What Cosmopolitan focuses on is fixing C by decoupling it from platforms, so it can be pleasant to use for writing small unix programs that are easily distributed to a much broader audience.
Assuming you have GCC on Linux, then all you need are the five additional files which are linked below:
# create simple c program on command line printf %s ' main() { printf("hello world\n"); } ' >hello.c # run gcc compiler in freestanding mode gcc -g -Os -static -fno-pie -no-pie -mno-red-zone -nostdlib -nostdinc \ -fno-omit-frame-pointer -pg -mnop-mcount \ -o hello.com.dbg hello.c -Wl,--gc-sections -fuse-ld=bfd \ -Wl,-T,ape.lds -include cosmopolitan.h crt.o ape.o cosmopolitan.a objcopy -SO binary hello.com.dbg hello.com # ~40kb static binary (can be ~16kb w/ MODE=tiny) ./hello.com
The above command fixes GCC so it outputs portable binaries that will
run on every Linux distro in addition to Mac OS X, Windows NT,
FreeBSD, OpenBSD, and NetBSD too. For details on how this works,
please read the αcτµαlly
pδrταblε εxεcµταblε blog post. This novel binary format is also
optional, since hello.com.dbg
is executable too, only on
your local system since it's an ELF binary.
Your program will also boot on bare metal too. In other words, you've written a normal textbook C program, and thanks to Cosmopolitan's low-level linker magic, you've effectively created your own operating system which happens to run on all the existing ones as well. Now that's something no one's done before.
Please join the Cosmopolitan Cosmonauts Google Group!
Cosmopolitan has been optimized by hand for excellent performance on modern desktops and servers. Compared with glibc, you should expect Cosmopolitan to be almost as fast, but with an order of a magnitude tinier code size. Compared with Musl or Newlib, you can expect that Cosmopolitan will generally go much faster, while having roughly the same code size, if not tinier.
In the case of the most important libc function, memcpy(), Cosmopolitan outperformed every other open source library tested. The chart below shows how quickly memory is transferred depending on the size of the copy. Since it's log scale, each grid square represents a 2x difference in performance. What makes Cosmopolitan so fast here is it uses uses several different memory copying strategies. For small sizes it uses an indirect branch with overlapping moves; for medium sizes it uses simd vectors, and for large copies it uses nontemporal hints which prevent cache thrash. Other libraries usually fall short because they use a one-size-fits-all strategy. For example, Newlib goes 10x slower for the optimal block size (half L1 cache) because it always does nontemporal moves.
Performing the best on benchmarks isn't enough. Cosmopolitan also uses a second technique that the above benchmark doesn't measure, which we call "trickle-down performance". For an example of how that works, consider the following common fact about C which is often overlooked. External function calls such as the following:
memcpy(foo, bar, n);
Are roughly equivalent to the following assembly, which leads compilers to assume that most cpu state is clobbered:
asm volatile("call memcpy" : "=a"(rax), "=D"(rdi), "=S"(rsi), "=d"(rdx) : "1"(foo), "2"(bar), "3"(n) : "rcx", "r8", "r9", "r10", "r11", "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6");
In other words the compiler assumes that, in calling the function, fifteen separate registers and all memory will be overwritten. See the System V ABI for further details. This can be problematic for frequently-called functions such as memcpy, since it inhibits many optimizations and it tosses a wrench in the compiler register allocation algorithm, thus causing stack spillage which further degrades performance while bloating the output binary size.
So what Cosmopolitan does for memcpy() and many other frequently-called core library leaf functions, is defining a simple macro wrapper, which tells the compiler the correct subset of the abi that's actually needed, e.g.
#define memcpy(DEST, SRC, N) ({ \ void *Dest = (DEST); \ void *Src = (SRC); \ size_t Size = (N); \ asm("call memcpy" \ : "=m"(*(char(*)[Size])(Dest)) \ : "D"(Dest), "S"(Src), "d"(n), \ "m"(*(char(*)[Size])(Src)) \ : "rcx", "xmm3", "xmm4", "cc"); \ Dest; \ })
What this means, is that Cosmopolitan memcpy() is not simply fast, it also makes unrelated code in the functions that call it faster too as a side-effect. When this technique was first implemented for memcpy() alone, many of the functions in the Cosmopolitan codebase had their generated code size reduced by a third.
For an example of one such function, consider strlcpy
,
which is the BSD way of saying strcpy
:
/** * Copies string, the BSD way. * * @param d is buffer which needn't be initialized * @param s is a NUL-terminated string * @param n is byte capacity of d * @return strlen(s) * @note d and s can't overlap * @note we prefer memccpy() */ size_t strlcpy(char *d, const char *s, size_t n) { size_t slen, actual; slen = strlen(s); if (n) { actual = MIN(n - 1, slen); memcpy(d, s, actual); d[actual] = '\0'; } return slen; }
If we compile our strlcpy
function, then here's the
assembly code that the compiler outputs:
/ compiled with traditional libc strlcpy: push %rbp mov %rsp,%rbp push %r14 mov %rsi,%r14 push %r13 mov %rdi,%r13 mov %rsi,%rdi push %r12 push %rbx mov %rdx,%rbx call strlen mov %rax,%r12 test %rbx,%rbx jne 1f pop %rbx mov %r12,%rax pop %r12 pop %r13 pop %r14 pop %rbp ret 1: cmp %rbx,%rax mov %r14,%rsi mov %r13,%rdi cmovbe %rax,%rbx mov %rbx,%rdx call memcpy movb $0,0(%r13,%rbx) mov %r12,%rax pop %rbx pop %r12 pop %r13 pop %r14 pop %rbp ret .endfn strlcpy,globl |
/ compiled with cosmopolitan libc strlcpy: mov %rdx,%r8 mov %rdi,%r9 mov %rsi,%rdi call strlen test %r8,%r8 je 1f cmp %r8,%rax lea -1(%r8),%rdx mov %r9,%rdi cmova %rax,%rdx call MemCpy movb $0,(%r9,%rdx) 1: ret .endfn strlcpy,globl |
That's a huge improvement in generated code size. The above two
compiles used the same gcc flags and no changes to the code needed to
be made. All that changed was we used cosmopolitan.h (instead of the
platform c library string.h) which contains ABI specialization macros
for memcpy
and strlen
. It's a great example
of how merely choosing a better C library can systemically eliminate
bloat throughout your entire codebase.