TL;DR Writing power-of-2 malloc memory allocator in C
Any sufficiently advanced technology is indistinguishable from magic or so they say. Today we are going to lift some shadows from the very basic thing that is dynamic memory management in an application process. This knowledge is essential for anyone who wants to write his/her own Programming Language so gaining this knowledge is not an optional thing since it opens a whole new lever of understanding how dynamic memory is managed by application and is split between different applications. To do that we need to have some basic understanding of how the actual physical thing is used by operating system. For those who don’t know anything about Virtual Memory or what Memory Pages are and which algorithms modern operating systems are using to manage those I suggest reading Operating Systems Concepts and for those who want to know all the guts of how Linux does this under the hood there is Understanding Linux Kernel.
For the lazy rest to put it simply physical memory is split into fixed size blocks which are called Memory Pages and depending on the computer architecture and operating system can be 2kb 4kb 16kb 32kb or even 4mb in size. Anytime we request memory from operating system using some syscall it’s best to have the size divisible by the actual operating operating system’s Memory Page size. If used simply this tends to waste a lot of memory so hence we need a way to be more efficient with our requests and also we need to think of a way to recycle the memory requested before which isn’t already needed by the application. All this opens a new area for exploration which is called Memory Allocation Algorithms and is quite a hot topic because not only we need to manage the recycling complexity but also concurrency and parallelism of the actual algorithms to be more efficient with modern multi-processor computers. Some notable examples of algorithm that do that and are used in real world are Doug Lea’s malloc, TLSF and jemalloc, also there are much more out there. In order to explore some basic ideas about memory allocators we’ll write the simplest algorithm commonly called power of 2. The essential idea behind it is to split blocks of memory into sizes divisible by 2 such as 32, 64, 128 and etc. By doing so we could reuse already used blocks by breaking them into more smaller onces. For example 128 block can be broken into 2 blocks of 64 and each of them can also further be broken to 32 sized blocks.
In general this tends to waste a lot of memory and introduces memory fragmentation. In worst case scenario half of all requested memory is wasted but usually in real world cases the actual number is near 25%. The memory wasting cost has computing performance gain as the algorithm itself is not hard to implement and works faster than more complex but conservative ones. Hence we are going to write a simple memory allocator also commonly called malloc since this is the library function name used by standard C library for allocating dynamic memory of variable size for the application process. We are going to implement our version of that function and related functions. Our target platform will be Linux and gcc compiler since we aren’t going to bother with portability as this will take much more time and effort. Final version of source code can be browsed in mymalloc repo but I strongly encourage you to write the whole thing from scratch yourself just to better understand the actual processes under the hood.
First let’s start with function definitions that we need to reimplement. Create a file called mymalloc.h. We’ll also define some functions for debug/stats purposes as those will be helpful during development process. Note: all this functions are defined by glibc in linux so gcc will complain about them being redefined during compilation but you can ignore those messages.
1 2 3 4 5 6 7 |
|
Now let’s create a file called mymalloc.c and define the actual functions. We’ll start with some initial values which are provided by operating system. The defined values are self explaining however I need to clarify the choice with 32 bytes for minimum block size. Since we’ll manage memory blocks as doubly linked list nodes, each memory block needs to have space for at least two pointers and one size value which can’t be more than the pointer size so that’s 3 pointers for each node. Maximum size for 3 pointers for 64-bit operating system is 8 bytes so it’s overall 24 bytes however block size needs to be power of 2 value and nearest size is 25 hence 32.
1 2 3 4 5 6 |
|
We need a place to save free memory blocks and a way to manage them so let’s first describe what memory block is and define initial doubly linked list for it that will contain start and end nodes pointing at each other. Those nodes won’t be used and are there just so that our code won’t contain conditional if statements when handling nodes.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Note that when memory block will be used by an application it needs to have only size value reserved because we don’t need doubly linked pointers as it’s not in a free block list. Thus we need a method to reserve memory block’s size field when we return it to the application. In order to ease interaction with memory block structure and it’s pointers let’s define some useful macros that will help us elevate the task of specifying casts manually. Pointer arithmetic can be sometimes daunting and puzzling in C but if used correctly it’s very powerful feature and macro system helps us do exactly that. We also need an easy way to link two memory blocks with each other and even sometimes replace those links so we’ll define some macros for that kind of operations too.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Let’s start with malloc function, first of all let’s correct initial size and add to it sizeof(size_t) since each allocated by an application block will preserve it.
1 2 3 4 5 6 7 8 |
|
Next as you can see we need to find an optimal size for requested size s. We’ll use a famous trick with bitwise shifting value to right in order to get optimal size for our memory block starting with MIN_BLOCK_SIZE. So the sizes in while loop will go from 32, 64, 128, 256 and further.
1 2 3 4 5 6 7 8 9 |
|
Next we need to handle large memory blocks. We’ll be allocating those using mmap syscall which gives us an arbitary memory block for requested size usually rounded to page size. In worst case we are loosing PAGE_SIZE-1 bytes, however for large memory blocks which we are going to handle separately we can ignore it because block might be resized using mremap syscall in the future reclaiming page which wasn’t used entirely before. Note: we’ve also added definition of MMAP_SIZE to our initial value definitions and set it equal to 1 MB.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
As you’ve already noticed we’ve used strange lock and unlock macros there. That’s because we’ll also be counting overall mmap blocks in mmap_size variable purely for statistical purposes. However since malloc function can be called from different threads simultaneously we need a way to synchronize modifications to this variable. And not only this variable can be modified simultaneously. Unfortunately as we are striving for most efficiency we can’t use immutable data structures for freelist and we need a way to concurrently modify it so that in the end all modifications would be consistent. Let’s use atomic spinlock for all our concurrency needs. For a description of what atomic spinlock is try reading Marek’s Reinventing Spinlocks as the code below was shamelessly copied from there. Also instead of simple freelist we could use lock-free doubly linked list data structure but since we are focusing our efforts on memory allocation I won’t describe it in here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Let’s continue our effort with malloc function. Now we need to handle blocks with size less than MMAP_SIZE
1 2 3 4 5 6 7 8 |
|
In order to find suitable block from memory list we’ll just look through it.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Now what if block is sufficient for our needs. We need to split it or return it whole if it’s needed entirely. The half left will be added back to freelist if remainder is more than or equals MIN_BLOCK_SIZE since there is no way we could add a block with less size.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
But wait, what if there are no free memory blocks? Let’s allocate some memory from heap using sbrk syscall. Now the difference with sbrk versus mmap is that sbrk allocates continuous virtual memory or simply speaking grows heap outward (note: stack grows inward and heap outward so when those two meet each other ka-boom!). For those of you who don’t know what heap is try reading this. Unfortunately sbrk is slower than mmap performance-wise, but we need a specific property of memory allocated using sbrk so we’ll be using it exclusively for that. Each newly allocated block has higher memory address than previous one and both of them are adjacent. We’ll use this property to always have a sorted freelist in order to connect adjacent memory blocks more easily and quicker. This is our way to fight memory fragmentation. Also in order to minimize number of sbrk calls we’ll allocate memory by larger chunks specified in ALLOC_SIZE instead of just number of pages we need.
1 2 3 4 5 6 7 8 9 10 11 |
|
Next after allocating new block using sbrk we just split it and add back to freelist. We are also counting the overall heap size allocated by sbrk calls just for stats. Also we save heap_start and heap_end in order to distinguish memory blocks allocated using sbrk from mmap blocks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Now the only thing left is to add the free block to the freelist. This might sound trivial but we need to handle a little bit more complexity than there might seem. First of all we need to insert blocks in sorted order. Next after inserting free block we need to merge it with adjacent blocks near on the left and right from it, if those blocks are continuously allocated in virtual memory space.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
That’s it with malloc now let’s write free. As you can see below we check if memory block is allocated using mmap and munmap it. Otherwise we add it back to freelist.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Finally we are checking the end of heap and give back free memory to operating system using sbrk with negative size if the size is at least GIVE_BACK_SIZE which we’ve also added to initial values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Next after free let’s implement realloc. As you can see it mostly repeats malloc, however we are using mremap for mmap blocks since those are already allocated and we need to resize them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Next we check if block already contains the necessary size or the size can be obtained by connecting adjacent blocks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Merging with adjacent block is quite complex operation thus we make it configurable so that if necessary we can always disable it. We handle both left and right adjacent cases. Left one is more complex and right one is simpler. Also we need to split block before connecting it with our initial one if the size has remainder, hence the complexity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
Finally we handle the case when there is no memory to merge and no to resize. We just allocate a new memory and copy data over into it. Afterwards we deallocate the previous memory block.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Main work is done, remaining are just supplementary functions which are self explanatory.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Now to test/benchmark this whole thing I wrote a little application which is called memsim and is in mymalloc repo. It reads memory allocation simulation file without requesting any dynamic memory from filesystem and calls malloc, realloc, free and user provided custom stats function while counting timings and overall runtime without using any heap memory, internally allocating memory on stack using alloca function, so that it’s workings don’t affecting the testing process. In order to generate a complex memory simulation (which is just a simple text file) I wrote genrandms application that just does that. Now a little benchmarks to compare our malloc with glibc malloc on some intense scenario. mymemsim and sysmemsim are memsim executables linked against our malloc implementation and glibc malloc implementation.
1 2 3 4 5 |
|
Now let’s try a multithreading benchmark
1 2 3 4 5 |
|
Unfortunately our malloc implementation is not as multithread friendly as it can be. We literally have one single global spinlock and we always lock it when we allocate memory. We can improve on that by splitting freelist into different partitions. This will sacrifice memory fragmentation fighting however will improve scalability over multiple processors as we can have different spinlocks for each partition and won’t have to wait between them. The implementation is more complex and can be studied in mymalloc repo mysmalloc.c file despite overall idea being simple. Let’s benchmark it again.
1 2 3 4 5 |
|
As you can see we’ve improved performance by partially sacrificing fragmentation, improving scalability, and partially increasing memory usage. Further improvements and experiments are possible as possibilities are countless however our time has run out so that’s it for today folks! Have a nice memory hacking time!