One common way is to reserve few bytes from the beginning of all allocations for "malloc headers". When user requests 20 bytes you actually allocate 24 bytes, so you can put the size in there. You can then return a pointer after the header. By substracting the header size from the pointer received by free() you can then find the headers again.
You might also want to have another 4 bytes to have a pointer, so you can construct the linked-list directly from the allocated regions. It's also possible to have some flag to indicate if a region is still allocated, or another pointer to make the list doubly-linked.. or whatever you want. I would personally forget about the bitmaps though.
One trick is to put the size both at the beginning and at the end of the region, so that you can scan backwards in memory to find the size of the previous region, and calculate position of it's header, thus being able to check if it's possible to join the two regions together.
There are other ways, like keeping the headers in some separate space, or splitting regions of certain size (like a page) and keeping the headers at the end of the large region. If you want to be truly efficient, you probably need a different policy for small and large blocks.
One decent malloc is so called "dlmalloc"
http://gee.cs.oswego.edu/dl/html/malloc.html which is available as source code, and is public domain. Might or might not be usable for kernel code though..
added: I strongly suggest that you test your malloc outside your kernel. It's very easy to not get it right the first time. I'd even suggest that you test it with some normal application.
If you are using some unix-like system (like Linux) you can compile it as a shared library and use LD_PRELOAD to get applications to use it without any recompile. If (when) they crash, you can try to reproduce the problem with a little test and debug the library without getting any nasty triple faults.