Unified (userland) Memory Management I [long]

Don · **Joined:** Wed Apr 27, 2016 12:52 pm **Posts:** 11

I'm working in an embedded/appliance market on custom hardware.
So, no constraints as to API's, porting legacy code, etc. The
environment is real-time, multithreaded and object-based. The pieces
of memory being managed are large-ish; the amount of memory available
is significant.

I'm unifying the user interface to manage memory -- no need for
separate API's for partitions/buffer pools vs. dynamic allocation.
These calls apply to a logical address space.

To that end, there are really just 4 interfaces:

create_heap
destroy_heap
allocate
release

For the time being, ignore the heap management functions -- assume it
exists.

The allocate request requires an indication of the HEAP against which the
request is made. Of course, it must exist, else FAIL. And, a SIZE for the
desired piece of memory. On success, a handle/pointer to the memory
fragment allocated is returned along with the actual size of the fragment.

The caller is expected to remember this handle. (well, if he ever expects
to release the fragment back to the heap!)

Nothing magical, here. And, nothing predictable, either! The caller has
no way of knowing how a request will be satisfied, if it will be satisfied,
how long the operation will take, etc.

To give the caller control over these aspects of the operation, the
allocation request is augmented with a variety of options that govern
the choices made inside the allocator.

[changing focus, for a moment, to expose some key details...]

The release operation takes the handle to a previously allocated fragment
AS RETURNED BY the allocate operation. Plus, the handle to the heap
from which the request was granted. Specifying an invalid handle or
the incorrect heap results in FAIL.

Additionally, the release operation allows the caller to specify how the
fragment should be reintegrated into the "Free List" (maintained for the
specified heap):

it can be placed at the HEAD of the free list, TAIL, inserted before the
first free fragment having an ADDRESS exceeding that of the newly released
fragment or before the first free fragment having a SIZE greater than the
size of the fragment being released
once inserted into the free list, it can be left AS_ALLOCATED as a free-standing
fragment (retaining the size it had "in the wild"); or, if can be merged with
"adjacent" fragments to COALLESCE and (possibly) reduce the number of
fragments in the free list as well as enlarging those.

Two other orthogonal features govern the operation:

the degree of heap sharing between potentially competing threads (i.e.,
allow the request to proceed UNFETTERED as if there are no other competitors
vying for these resources vs. ensuring ATOMIC operation in the presence of
potential competitors)
the extent of blocking tolerated by the request (i.e., return IMMEDIATEly
if FAIL, block in the hope of completing the desired service but no longer
than some specified TIMEOUT interval, WAIT indefinitely)

[A deadline handler is implemented separately from this so even if a thread
opts to WAIT indefinitely, the system will eventually kill it and reclaim its
resources. The TIMEOUT option gives the thread an opportunity to
regain control before that happens and, presumably, effect another
strategy.]

It should be obvious that a user can arrange to KNOW what the free list
looks like by carefully releasing fragments and constraining the actions
that the memory manager applies to them. E.g., if it has a set of
many fixed size fragments, it can force the memory manager to leave
these fragments in their current, fixed sizes so future requests
(coincidentally, for exactly these same sizes!) will be easily and
immediately handled -- by ANY fragment that is available (i.e., no
need to search through the free list if they're all the right size!)

As such, this gives you a buffer pool/partition capability using the
same framework (and code) that would otherwise have been used by
the (dynamic) memory manager.

[(sigh) This is getting long; but, if I elide details, someone will surely ask for them!]

The allocate request similarly takes options that govern how the
fragment in the free list is selected -- and processed -- to satisfy the
request.

fragment selection strategy: FIRST_FIT, LAST_FIT, BEST_FIT, WORST_FIT,
EXACT_FIT, NEVER_FIT (the latter being a way to stress test applications)
allocation trimming: TRIM, UNTRIMMED (i.e., if the fragment is larger than
requested, should it be trimmed to maximize the amount of memory remaining
in the free list?)
free remainder: TAKE_HEAD, TAKE_TAIL (which part of the TRIMmed fragment
should be returned to the caller vs. retained in the free list)
size homogeneity: AS_REQUESTED, POWERS_OF_TWO, LISTED_VALUES,
MULTIPLES (how should the specified SIZE be interpreted in locating a suitable
free fragment)

And, as with the release operation, the sharing and blocking options.

With these complementary operations, a diligent caller can know how they will be
using a particular heap and structure their requests to minimize the work done
by the memory management subsystem and obtain predictable/deterministic behavior
from it.

The number of options can usually be hidden inside a macro -- as a caller is unlikely
to be changing them from one invocation to the next. This helps hide complexity.

The options provided by the caller can be exploited by the memory management
routines to optimize how it provides its services. E.g., if the caller is always
releasing fragments and reintroducing into the free list by SIZE, then a subsequent
allocate request for FIRST_FIT effectively gives BEST_FIT -- as any fragments
encountered in the free list AFTER the first one large enough to satisfy the
request will be larger than this first one!

So...

Anything I have missed or failed to consider? (I want to be sure the groundwork
is in plce before posing more detailed questions)

With a bit of tinkering with options, you can see how you can create a buffer
pool from a heap (by repeatedly allocating AS_REQUESTED buffers TRIMmed
to that size; then releasing AS_ALLOCATED to the TAIL of the free list --
the HEAD would yield equivalent functionality but execute in longer time
as the next request could be "prematurely" handled by a previously released
buffer -- even though a large portion of the free space is still maintained
as a single large fragment, not yet trimmed by allocations!)

Brendan · **Posted:** Thu Apr 28, 2016 2:10 am

Hi,

As a general rule of thumb:

Generic code is rarely ideal for any specific purpose.
What you're doing is trying to create an over-complicated "swiss army knife". It will be useless for all the cases that can't be foreseen, and for all of the cases that were foreseen it will be a bloated mess due to support for cases that aren't needed by the application.

A better idea would be to provide functions to create and destroy memory pools, and then let the application provide its own heap manager/s that are custom designed specifically for the application (but also provide a simple "no frills" default heap manager for cases where the application simply doesn't care and is happy with any old "malloc() like" thing).

Cheers,

Brendan

Don · **Joined:** Wed Apr 27, 2016 12:52 pm **Posts:** 11

Brendan wrote:

As a general rule of thumb:

Generic code is rarely ideal for any specific purpose.

Which confirms that any old malloc() like thing is not ideal for this specific purpose!

Quote:

What you're doing is trying to create an over-complicated "swiss army knife". It will be useless for all the cases that can't be foreseen, and for all of the cases that were foreseen it will be a bloated mess due to support for cases that aren't needed by the application.

Have you considered what is involved to handle each of the "options" I described? Most are trivial to implement. Those that are more complicated (blocking in the allocator, opting for a lock on the heap itself -- or not) are required even if the simpler options are omitted/hardwired. They can't be efficiently implemented outside the memory manager.

Consider how you would allow a thread to wait for a particular allocation request to be handled -- in the absence of sufficient resources in the free list -- spin on the failing malloc()? And, hope no other thread sneaks in to grab any memory that becomes available while the other thread is busy spinning?

Or, how you would implement a timeout on that spinning -- so the thread can start recovery actions before the deadline handler is invoked on the thread (a sign that you've wasted resources with nothing to show for the effort)?

So, to implement specific different subsets of the various "simple" options, you end up having to replicate the more complex code; write a version that handles fixed size "buffers", another that handles page-aligned allocations; another that handles fine-grained requests; etc. And, hope that each of these implementations is free of errors and comprehensively tested/validated (the functionally identical code stanzas would have to be exercised in each different implementation instead of exercised once in a "shared" implementation).

I can, instead, wrap the "swiss army knife" implementation with "simple interfaces" for those cases that always need a specific set of options. A particular combination of options that are never used can simply not have a suitable wrapper provided.

Quote:

A better idea would be to provide functions to create and destroy memory pools, and then let the application provide its own heap manager/s that are custom designed specifically for the application (but also provide a simple "no frills" default heap manager for cases where the application simply doesn't care and is happy with any old "malloc() like" thing).

The entire RTOS is "designed specifically for the application". I don't include "things that won't be needed" (unused features and capabilities) as those things cost resources. Those resources add to the selling price of the product. And, as I'm not hosting other applications (I design appliances, not workstation environments), there's no point in including support for things that might be used (but aren't).

Brendan · **Posted:** Thu Apr 28, 2016 2:34 pm

Hi,

Don wrote:

Quote:

What you're doing is trying to create an over-complicated "swiss army knife". It will be useless for all the cases that can't be foreseen, and for all of the cases that were foreseen it will be a bloated mess due to support for cases that aren't needed by the application.

Have you considered what is involved to handle each of the "options" I described?

Yes. For example, the "FIRST_FIT, LAST_FIT, BEST_FIT, WORST_FIT, EXACT_FIT, NEVER_FIT" menagerie implies that (internally) the memory manager is going to have to use an extremely stupid and slow "list of entries" approach to handle all the unnecessary options; and this will involve searching the list and can't/won't be "O(1)" for any/many of the important cases, and will therefore suck (and will probably suck so badly that it's completely unusable for "real-time" where you're meant to guarantee a worst case max. time for all operations and can't accept "searching an unknowable number of entries").

Don wrote:

Those that are more complicated (blocking in the allocator, opting for a lock on the heap itself -- or not) are required even if the simpler options are omitted/hardwired. They can't be efficiently implemented outside the memory manager.

The more "complicated" options are not required at all for most applications. If they can't be implemented efficiently inside an "application provided memory manager" (e.g. because applications can't use locks/mutexes/semaphores, or because your OS doesn't provide an efficient communication mechanism) then the OS will suck for a large number of things (that don't necessarily have anything to do with memory management).

Don wrote:

Consider how you would allow a thread to wait for a particular allocation request to be handled -- in the absence of sufficient resources in the free list -- spin on the failing malloc()? And, hope no other thread sneaks in to grab any memory that becomes available while the other thread is busy spinning?

If, for some completely unimaginable reason I actually wanted to wait for memory (on an embedded system where my application is likely the only application, on hardware that was supposed to have a "significant" amount of memory available, on an OS that is supposed to be real-time and therefore supposed to guarantee maximum worst case times), I would:

implement a memory manager that has a list of waiters (including how much memory they're waiting for and their thread ID)
when freeing memory; examine the list of waiters and if one or more of the waiters can be satisfied allocate the memory for them, and send a "here's the memory you were waiting for" message to the waiter
Have waiters blocked waiting to receive a message (including "wait for message with timeout" and a way to remove a thread from the list of waiters, if and only if the specific application actually does need the extra complexity this involves)

Don wrote:

I can, instead, wrap the "swiss army knife" implementation with "simple interfaces" for those cases that always need a specific set of options. A particular combination of options that are never used can simply not have a suitable wrapper provided.

For a silly/simple example; lets say that (as part of my application) I want to log all packets sent and received over TCP/IP. I have a structure containing the details I want (dest IP address, etc) which includes a "next" field (because I'm planning to use a singly linked list of these structures) and this structure is exactly 123 bytes. For cache/TLB/data locality (performance) I want to ensure that the memory used for these structures is within the same range, so I allocate a memory pool specifically for these structures. I know (from CPU manufacturer info) that it's more efficient to align the structures on an 64-byte boundary (as this happens to be cache line size) so I'm looking at managing blocks of 128 bytes. When there are more than 5000 of these entries I want to append the oldest 2500 entries to a file on disk and free the memory the blocks used. I allocate a pool that's only large enough to allow 10000 entries to be allocated, and if that pool runs out of memory then I know that something must have gone wrong with disk IO so I terminate the application.

To get "O(1)" for everything that matters for this application (without any stupid "if(FIRST_FIT) { ....} else if(LAST_FIT) { ....}" branching/bloat); I create a pool (that is exactly "10000 * 128" bytes because that's all I need) and turn it into a singly linked list of free blocks (using the structure's "next" field that I already have). My application's "lockless" allocator (assuming 32-bit 80x86) looks like this:

Code:

allocBlock:
    mov eax,[nextBlock]
.retry:
    test eax,eax
    je .done
    mov edx,[eax]                          ;edx = next
    lock cmpxchg [nextBlock],edx
    jne .retry

    lock add [allocatedBlocksMinus5000],1  ;Should we send old entries to disk?
    jne .done                              ; no, "freeBlocks - 5000 != 0"
    lock bts [allocatedBlocksMinus5000],31  ;Set the "sending to disk in progress" flag (in highest bit)
    jc .done                            ;Don't notify if the "sending to disk in progress" was already set
    call startDiskPurge                    ;Notify the thread responsible for sending the log to disk
.done:
    ret

I'd only free 2500 blocks at a time (and this is only ever done by the thread responsible for appending the data to the file on disk). This code would unlink the oldest 2500 entries, do the disk IO stuff, then merge the list of 2500 "now free" blocks with the allocator's list of free blocks in a single lockless "lock cmpxchg [nextBlock], ..." loop, then do a "lock sub allocatedBlocksMinus5000],2500 | (1 << 31)" to atomically adjust the number of allocated blocks and clear the "sending to disk in progress" flag at the same time.

Now see if you can tell me exactly how your "swiss army knife" would be used to implement this "application specific lockless allocator" better than the application's own custom designed allocator that I've described. Note that this is only one example (with relatively simple requirements), and an application may have/want many different allocators each specifically designed to be "as ideal as possible" for extremely different requirements/usage.

Don wrote:

Quote:

A better idea would be to provide functions to create and destroy memory pools, and then let the application provide its own heap manager/s that are custom designed specifically for the application (but also provide a simple "no frills" default heap manager for cases where the application simply doesn't care and is happy with any old "malloc() like" thing).

The entire RTOS is "designed specifically for the application". I don't include "things that won't be needed" (unused features and capabilities) as those things cost resources. Those resources add to the selling price of the product. And, as I'm not hosting other applications (I design appliances, not workstation environments), there's no point in including support for things that might be used (but aren't).

"OS designed specifically for the application" is an oxymoron. Either it's designed for one specific application and therefore it's part of one specific application and not an OS at all; or it's not designed specifically for one specific application.

Cheers,

Brendan

gerryg400 · **Posted:** Thu Apr 28, 2016 3:02 pm

I see two issues, although you may decide that they are unimportant for your project.

Firstly,

Quote:

it can be placed at the HEAD of the free list, TAIL, inserted before the
first free fragment having an ADDRESS exceeding that of the newly released
fragment or before the first free fragment having a SIZE greater than the
size of the fragment being released

This seems to imply that you have already chosen a linked list to track free memory. In some small embedded systems this is okay but it doesn't scale for memory size and it doesn't scale for multicore. Furthermore you will lose the freedom to improve your internal algorithm because your interface describes your internals. You cannot change the internals without changing the interface.

Secondly, it looks like it would be fairly easy for a thread to use the options you provide to deliberately fragment memory. Not sure whether that's an issue for you or not.

Don · **Joined:** Wed Apr 27, 2016 12:52 pm **Posts:** 11

(sigh) I really dislike this sort of user interface.

Apologies if I botch the attributions... This is simply too tedious to interact like this!

Quote:

Have you considered what is involved to handle each of the "options" I described?

Yes. For example, the "FIRST_FIT, LAST_FIT, BEST_FIT, WORST_FIT, EXACT_FIT, NEVER_FIT" menagerie implies that (internally) the memory manager is going to have to use an extremely stupid and slow "list of entries" approach to handle all the unnecessary options; and this will involve searching the list and can't/won't be "O(1)" for any/many of the important cases, and will therefore suck (and will probably suck so badly that it's completely unusable for "real-time" where you're meant to guarantee a worst case max. time for all operations and can't accept "searching an unknowable number of entries").

Actually, it is MORE efficient (in time and space) exactly because the (efficient) developer KNOWS how the heap will be used and will take measures to ensure it remains in a state conducive to his future requests.

Consider how a buffer pool works; essentially, all you have is a list (regardless of whether implemented as such or as a bitmap, etc.) that allows you to locate AN available buffer -- in almost constant time (depends on how you represent "free").

How is that any different from:

Code:

(pointer, size) := allocate(HEAP, BUFFER_SIZE, EXACT_FIT, FIRST_FIT, TRIM, TAKE_ANY, AS_REQUESTED);

?

The creation of the pool happens "for free" -- each new buffer requested chops off a BUFFER_SIZE'd piece of the heap and delivers it to the caller. When the caller is done, he can:

Code:

result := release(HEAP, pointer, AT_HEAD, AS_ALLOCATED)

to return the buffer to the heap WHILE KEEPING IT AS A DISCRETE FRAGMENT within the heap. The next
such request would (coincidentally) find that the first fragment happens to be EXACTLY what the caller is looking for!

If requests come in faster than releases, then each request incurs some incremental FIXED cost to slice the remaining large fragment in the free list (that represents all of the memory that has NEVER been allocated) to form a new BUFFER_SIZE fragment for use by the user. Eventually, it, too, will be released -- and retain its size in the heap.

You can expedite the cost of dicing up the heap by using:

Code:

result := release(HEAP, pointer, AT_TAIL, AS_ALLOCATED)

to ensure the next request encounters the "big fragment of unused memory" before it encouters any previously released BUFFER_SIZE pieces therefore causing them to be sliced and diced early on.

If you want to remove ALL of the incremental costs of chopping up the heap into these BUFFERS, then just iterate through the process before you elect to actually use any of them.

Quote:

Don wrote:

Those that are more complicated (blocking in the allocator, opting for a lock on the heap itself -- or not) are required even if the simpler options are omitted/hardwired. They can't be efficiently implemented outside the memory manager.

The more "complicated" options are not required at all for most applications. If they can't be implemented efficiently inside an "application provided memory manager" (e.g. because applications can't use locks/mutexes/semaphores, or because your OS doesn't provide an efficient communication mechanism) then the OS will suck for a large number of things (that don't necessarily have anything to do with memory management).

What do you see the role of an operating system to be? Let's have the application developer write his own memory management routines, his own scheduler, his own file system, etc.?

In MY applications, the complicated options ARE required. Perhaps you missed:

Quote:

I'm working in an embedded/appliance market on custom hardware. So, no constraints as to API's, porting legacy code, etc. The environment is real-time, multithreaded and object-based. The pieces of memory being managed are large-ish; the amount of memory available is significant.

Would you advocate each application sorting out how to handle its own deadline violations? Perhaps dedicate a hardware timer to each thread so IT could program the timer to notify it when its deadline has passed? Or, would you embody that as an OS service?

Would you advocate each thread and task/process communicate with its peers to come up with a mutually agreeable scheduling decision, NOW? Or, would you embody that as an OS service?

Would you insist that all memory consumers use a memory manager that imposes the overhead of a mutex on every access -- even if the memory manager is just being used by a single thread -- AT THIS TIME? Or, let each thread craft its own memory services??

Quote:

Don wrote:

Consider how you would allow a thread to wait for a particular allocation request to be handled -- in the absence of sufficient resources in the free list -- spin on the failing malloc()? And, hope no other thread sneaks in to grab any memory that becomes available while the other thread is busy spinning?

If, for some completely unimaginable reason I actually wanted to wait for memory (on an embedded system where my application is likely the only application, on hardware that was supposed to have a "significant" amount of memory available, on an OS that is supposed to be real-time and therefore supposed to guarantee maximum worst case times), I would:

implement a memory manager that has a list of waiters (including how much memory they're waiting for and their thread ID)
when freeing memory; examine the list of waiters and if one or more of the waiters can be satisfied allocate the memory for them, and send a "here's the memory you were waiting for" message to the waiter
Have waiters blocked waiting to receive a message (including "wait for message with timeout" and a way to remove a thread from the list of waiters, if and only if the specific application actually does need the extra complexity this involves)

Your comments suggest you are thinking too small. I'm not designing a microwave oven. Or, a process control system that monitors a few dozen sensors and controls as many actuators. Possibly on a multimillion dollar piece of equipment.

Think scores of processors distributed over large areas (three-dimensional football fields). Tackling things like VoIP/video/audio delivery, location awareness, speech synthesis/recognition, security/access control, physical plant, etc. Think hundreds of applications and as many concurrent users.

And, unlike a desktop where the user can reboot, kill a job, hit reset, etc., I have to run, reliably, 24/7/365. If Joe User in cubicle #743 wants to "play pong", I have to ensure that:

I still have the resources (time, memory, I/O's, available power) to satisfy the GUARANTEED services that The System must provide to its users
Joe User can't munge anything that he's not allowed to munge
I can harvest Joe User's resources if other, "more pressing" demands materialize.

Quote:

Don wrote:

I can, instead, wrap the "swiss army knife" implementation with "simple interfaces" for those cases that always need a specific set of options. A particular combination of options that are never used can simply not have a suitable wrapper provided.

For a silly/simple example; lets say that (as part of my application) I want to log all packets sent and received over TCP/IP. I have a structure containing the details I want (dest IP address, etc) which includes a "next" field (because I'm planning to use a singly linked list of these structures) and this structure is exactly 123 bytes. For cache/TLB/data locality (performance) I want to ensure that the memory used for these structures is within the same range, so I allocate a memory pool specifically for these structures. I know (from CPU manufacturer info) that it's more efficient to align the structures on an 64-byte boundary (as this happens to be cache line size) so I'm looking at managing blocks of 128 bytes. When there are more than 5000 of these entries I want to append the oldest 2500 entries to a file on disk and free the memory the blocks used. I allocate a pool that's only large enough to allow 10000 entries to be allocated, and if that pool runs out of memory then I know that something must have gone wrong with disk IO so I terminate the application.

To get "O(1)" for everything that matters for this application (without any stupid "if(FIRST_FIT) { ....} else if(LAST_FIT) { ....}" branching/bloat); I create a pool (that is exactly "10000 * 128" bytes because that's all I need) and turn it into a singly linked list of free blocks (using the structure's "next" field that I already have). My application's "lockless" allocator (assuming 32-bit 80x86) looks like this:

Code:

count := 0
while (count < 10000) {
   (pointer, size) := allocate(HEAP, 128, EXACT_FIT, FIRST_FIT, TRIM, TAKE_TAIL, AS_REQUESTED);
   if (pointer != nil)
      result := release(HEAP, pointer, BY_LOCATION, AS_ALLOCATED)
}

creates your 10000 128 byte buffers and ensures they are "ordered" BY_LOCATION.

The NEXT allocate request will yield the buffer having the lowest memory address -- and will provide it in constant time. The allocation after that will allocate the buffer having the next higher address -- in the exact same, constant time.

Quote:

I'd only free 2500 blocks at a time (and this is only ever done by the thread responsible for appending the data to the file on disk). This code would unlink the oldest 2500 entries, do the disk IO stuff, then merge the list of 2500 "now free" blocks with the allocator's list of free blocks in a single lockless "lock cmpxchg [nextBlock], ..." loop, then do a "lock sub allocatedBlocksMinus5000],2500 | (1 << 31)" to atomically adjust the number of allocated blocks and clear the "sending to disk in progress" flag at the same time.

Now see if you can tell me exactly how your "swiss army knife" would be used to implement this "application specific lockless allocator" better than the application's own custom designed allocator that I've described. Note that this is only one example (with relatively simple requirements), and an application may have/want many different allocators each specifically designed to be "as ideal as possible" for extremely different requirements/usage.

I counter (REAL example): you have live video coming off a camera on a node, "somewhere". You have to examine the video, in real time (i.e., it's an endless stream, even if you opt to buffer it somewhere, you still have to be able to process it at the same effective rate that it is being generated). You have to monitor for signs of motion in a certain portion of the image and signal an event when that is detected or persists for some period of time.

If you sense motion, you have to determine if a person is present (i.e., not just trees blowing in the breeze) and, if so, recognize his/her face. Based on who you do/don't recognize, you may have to unlock a door, connect a bidirectional audio stream between the guest and the receptionist, notify the boss that his wife has arrived, etc.

At the same time, you have to record all of this to persistent store.

And, present it for live review by an "attendant" located "in the guard shack".

Of course, Joe User shouldn't be able to eavesdrop on any of this stuff happening -- even if the resources to accomplish these tasks happen to be coming from "his" node!

And, all these "applications" are running on different device nodes scattered around the building. How many memory managers are you going to write to handle this "simple" set of applications? How much time are you going to devote to quantifying, documenting and validating each of their operations?

MIPS are dirt cheap. So is memory. Developer time is where the money gets wasted -- and user disappointment. "Gee, should I run each node at 300MHz? Or 600? 64MB of DRAM? or 256MB? One core or two? 100BaseTX or GBe?"

If I spend a few instruction fetches in a critical region... feh.

Quote:

"OS designed specifically for the application" is an oxymoron. Either it's designed for one specific application and therefore it's part of one specific application and not an OS at all; or it's not designed specifically for one specific application.

[/quote]

Seems like an arrogant assumption. Do you think embedded devices are just "spaghetti code" simply because YOU can't load an application of your own choosing onto it? Is a math library no longer a math library because a single application is using it? Amusing when you consider there are probably two orders of magnitude MORE embedded systems in the world than all the desktop machines combined!

Is Android only an OS *if* a user opts to load some non-default application onto it? I.e., if he uses the stock phone, as is, how does that differ from using an OS in a "single" application?

Don · **Joined:** Wed Apr 27, 2016 12:52 pm **Posts:** 11

gerryg400 wrote:

I see two issues, although you may decide that they are unimportant for your project.

Firstly,

Quote:

it can be placed at the HEAD of the free list, TAIL, inserted before the
first free fragment having an ADDRESS exceeding that of the newly released
fragment or before the first free fragment having a SIZE greater than the
size of the fragment being released

This seems to imply that you have already chosen a linked list to track free memory. In some small embedded systems this is okay but it doesn't scale for memory size and it doesn't scale for multicore. Furthermore you will lose the freedom to improve your internal algorithm because your interface describes your internals. You cannot change the internals without changing the interface.

I elected to express the operation in terms of a "list" -- its easier for people to associate "order" and "ranking" (by some set of criteria) with a sequential entity. Also to imagine how you can look at such a structure "backwards" to get complementary results. (e.g., WORST_FIT on a list sorted BY_SIZE suggests you start looking from the "end" instead of the "beginning")

My question is intended to elicit comments regarding other "options" that I may not have considered. These are a collection of how I've addressed individual needs of the applications, rolled into a single, unified manager (solve and validate the problem ONCE).

Quote:

Secondly, it looks like it would be fairly easy for a thread to use the options you provide to deliberately fragment memory. Not sure whether that's an issue for you or not.

Yes. It's also easy for a task to try to wire down all of the memory available in the machine! Or, to hammer away at a particular resource to the exclusion of other consumers. A "black box" COTS allocator is no safer. And, if you're writing an allocator for each specific need (as Brendan suggested elsewhere), then you'll have diligently documented how each of those behaves, right?

Or, just relied on their proper operation to be verified by end-to-end testing?

I assume developers are competent. But, also have mechanisms in place to ensure these sorts of abuses penalize the individual developer, and not the other applications running in the system. E.g., you need appropriate credentials (capabilities) to use any resource (time, memory, individual services, system power, etc.) and each of those have fine grained "per client" controls: you can use this service for this purpose, ONLY; you can use this much memory for this period of time, ONLY; etc.

If a developer naively picks a set of options for allocations/releases that shoots him in the foot, then his application will suffer the consequences (as its HIS heap that he's mucking with, not "mine")

OTOH, I have different mechanisms for "user scripting" that assume the user is NOT competent.

gerryg400 · **Posted:** Thu Apr 28, 2016 4:43 pm

Quote:

I elected to express the operation in terms of a "list" -- its easier for people to associate "order" and "ranking" (by some set of criteria) with a sequential entity. Also to imagine how you can look at such a structure "backwards" to get complementary results. (e.g., WORST_FIT on a list sorted BY_SIZE suggests you start looking from the "end" instead of the "beginning")

But what if there is not a list. What if you devise an algorithm where every free item is stored in a such a way that the correct size is immediately available and there is no opportunity to fit. In this case FIRST == LAST. If the user asked for WORST_FIT would you waste time looking for that when your algorithm has automatically already given you the PERFECT_FIT ? And if so, why ?

In this case where there are actually no lists, what do HEAD and TAIL mean ?

Also, I think that by allowing the user to specify that free areas not be coalesced you make it difficult for perfectly fine algorithms like the 'boundary tag' algorithm to work optimally.

Don · **Joined:** Wed Apr 27, 2016 12:52 pm **Posts:** 11

gerryg400 wrote:

Quote:

I elected to express the operation in terms of a "list" -- its easier for people to associate "order" and "ranking" (by some set of criteria) with a sequential entity. Also to imagine how you can look at such a structure "backwards" to get complementary results. (e.g., WORST_FIT on a list sorted BY_SIZE suggests you start looking from the "end" instead of the "beginning")

But what if there is not a list. What if you devise an algorithm where every free item is stored in a such a way that the correct size is immediately available and there is no opportunity to fit. In this case FIRST == LAST. If the user asked for WORST_FIT would you waste time looking for that when your algorithm has automatically already given you the PERFECT_FIT ?

Yes.

Quote:

And if so, why ?

How would you issue a request for the smallest fragment in the heap? Or, the largest? Do you want to expose the sizes of all fragments to the caller directly and let him choose?

YOU know what your code is going to do with the memory. YOU know what you've already extracted from the heap. Presumably, YOU know how you want to use the current/remaining contents of that block of memory. You're not dealing with an subsystem that has to be able to fit with "generic" applications but, rather, with very specific ones.

Why not create all buffers to be exactly a page size -- on the premise that you can move them more effectively and consistently via RPC than having to marshall itsy-bitsy individual arguments? It's far easier for me to move pages between nodes than it is to move a hundred bytes up and down the protocol stack; why not force all API's to exploit big blocks of data (like moving chunks of audio and video around)?

Quote:

In this case where there are actually no lists, what do HEAD and TAIL mean ?

Depends on how the developer has elected to place released fragments into the "list". If he's asked for them to be "inserted" by location, then "TAIL" means "the fragment occurring at the highest memory address". If he's asked for them to be inserted by size, then "HEAD" means "the smallest fragment". Can you come up with a better lexicon to convey these concepts?

Quote:

Also, I think that by allowing the user to specify that free areas not be coalesced you make it difficult for perfectly fine algorithms like the 'boundary tag' algorithm to work optimally.

Then why have a need for buffer pools? Why bother mapping logical fragments onto physical pages? Why write a special manager for each application -- if one size fits all?

[[Meaning no disrespect but I'm finding this web-based UI to be too dreadfully inefficient. I can get far more done via mailing lists so I'm going to bow out of this.]]

Good luck to you, all!

Brendan · **Posted:** Thu Apr 28, 2016 6:38 pm

Hi,

Don wrote:

Brendan wrote:

Don wrote:

Have you considered what is involved to handle each of the "options" I described?

Yes. For example, the "FIRST_FIT, LAST_FIT, BEST_FIT, WORST_FIT, EXACT_FIT, NEVER_FIT" menagerie implies that (internally) the memory manager is going to have to use an extremely stupid and slow "list of entries" approach to handle all the unnecessary options; and this will involve searching the list and can't/won't be "O(1)" for any/many of the important cases, and will therefore suck (and will probably suck so badly that it's completely unusable for "real-time" where you're meant to guarantee a worst case max. time for all operations and can't accept "searching an unknowable number of entries").

Actually, it is MORE efficient (in time and space) exactly because the (efficient) developer KNOWS how the heap will be used and will take measures to ensure it remains in a state conducive to his future requests.

No. Where justified (by the application's requirements), the developer will use your "swiss army knife" to pre-allocate memory to manage via. their own internal allocator/s because they know yours is a bloated and inappropriate mess.

Don wrote:

Consider how a buffer pool works; essentially, all you have is a list (regardless of whether implemented as such or as a bitmap, etc.) that allows you to locate AN available buffer -- in almost constant time (depends on how you represent "free").

How is that any different from:

Code:

(pointer, size) := allocate(HEAP, BUFFER_SIZE, EXACT_FIT, FIRST_FIT, TRIM, TAKE_ANY, AS_REQUESTED);

?

It's different because "allocate()" can't be optimised specifically for "BUFFER_SIZE" (and has to be designed to handle arbitrary sizes), can't be optimised specifically for EXACT_FIT alone (and has to be able to support NEVER_FIT), can't be optimised specifically for FIRST_FIT alone (and has to be able to suppoirt LAST_FIT, BEST_FIT, WORST_FIT), can't be optimised specifically for TRIM alone (and has to be able to support UNTRIMMED), etc. It's all of these "unnecessary/unused by the specific application" requirements that makes it impossible to implement an "allocate()" that isn't inferior for every specific application.

Don wrote:

Brendan wrote:

Don wrote:

Those that are more complicated (blocking in the allocator, opting for a lock on the heap itself -- or not) are required even if the simpler options are omitted/hardwired. They can't be efficiently implemented outside the memory manager.

The more "complicated" options are not required at all for most applications. If they can't be implemented efficiently inside an "application provided memory manager" (e.g. because applications can't use locks/mutexes/semaphores, or because your OS doesn't provide an efficient communication mechanism) then the OS will suck for a large number of things (that don't necessarily have anything to do with memory management).

What do you see the role of an operating system to be? Let's have the application developer write his own memory management routines, his own scheduler, his own file system, etc.?

"Operating systems" is too broad to have a single role. What do you think the role of *your* OS should be? How does making memory management suck (and then refusing to provide a way for applications to use custom designed alternative allocators that don't suck for the application's specific usage) help to fulfil that role?

Don wrote:

In MY applications, the complicated options ARE required. Perhaps you missed:

Quote:

I'm working in an embedded/appliance market on custom hardware. So, no constraints as to API's, porting legacy code, etc. The environment is real-time, multithreaded and object-based. The pieces of memory being managed are large-ish; the amount of memory available is significant.

Would you advocate each application sorting out how to handle its own deadline violations? Perhaps dedicate a hardware timer to each thread so IT could program the timer to notify it when its deadline has passed? Or, would you embody that as an OS service?

Would you advocate each thread and task/process communicate with its peers to come up with a mutually agreeable scheduling decision, NOW? Or, would you embody that as an OS service?

Or maybe you could ignore all the issues I've raised and go off on a random tangent that has no point? We are (were) talking about memory management. None of these things have anything to do with memory management.

Don wrote:

Would you insist that all memory consumers use a memory manager that imposes the overhead of a mutex on every access -- even if the memory manager is just being used by a single thread -- AT THIS TIME? Or, let each thread craft its own memory services??

I'm advocating that applications should be able to provide their own specialised memory manager (to replace your default memory manager where deemed necessary by the application developer); and shouldn't be forced to use a "jack of all trades, master at none" spaghetti monster of bloat. This includes having or not having the overhead of a mutex (depending on what the application developer actually requires), and it includes not putting up with the overhead of "if( MUTEX_WANTED ) { ... } else { ... }" branches needed to make "unnecessary for the specific application" run-time decisions.

Don wrote:

Brendan wrote:

Don wrote:

Consider how you would allow a thread to wait for a particular allocation request to be handled -- in the absence of sufficient resources in the free list -- spin on the failing malloc()? And, hope no other thread sneaks in to grab any memory that becomes available while the other thread is busy spinning?

If, for some completely unimaginable reason I actually wanted to wait for memory (on an embedded system where my application is likely the only application, on hardware that was supposed to have a "significant" amount of memory available, on an OS that is supposed to be real-time and therefore supposed to guarantee maximum worst case times), I would:

implement a memory manager that has a list of waiters (including how much memory they're waiting for and their thread ID)
when freeing memory; examine the list of waiters and if one or more of the waiters can be satisfied allocate the memory for them, and send a "here's the memory you were waiting for" message to the waiter
Have waiters blocked waiting to receive a message (including "wait for message with timeout" and a way to remove a thread from the list of waiters, if and only if the specific application actually does need the extra complexity this involves)

Your comments suggest you are thinking too small. I'm not designing a microwave oven. Or, a process control system that monitors a few dozen sensors and controls as many actuators. Possibly on a multimillion dollar piece of equipment.

Think scores of processors distributed over large areas (three-dimensional football fields). Tackling things like VoIP/video/audio delivery, location awareness, speech synthesis/recognition, security/access control, physical plant, etc. Think hundreds of applications and as many concurrent users.

Repeat this until you understand it:

I am writing an OS that will be used for an unknown but very wide variety of very different applications; therefore it's inconceivable for me to cater to the specific requirements of every possible application; therefore I am unable to implement a memory manager that is "ideal" for all of the radically different usages. My "swiss army knife" memory manager must suck by definition; I should give applications the ability to provide their own custom designed alternative that doesn't suck for their specific usage; and if I do provide applications with the ability to use their own alternatives I don't have any reason to (attempt to and then fail to) cover all usages in a single memory manager in the first place.

.

Don wrote:

And, unlike a desktop where the user can reboot, kill a job, hit reset, etc., I have to run, reliably, 24/7/365. If Joe User in cubicle #743 wants to "play pong", I have to ensure that:

I still have the resources (time, memory, I/O's, available power) to satisfy the GUARANTEED services that The System must provide to its users
Joe User can't munge anything that he's not allowed to munge
I can harvest Joe User's resources if other, "more pressing" demands materialize.

Um, what? You initially described the OS as "embedded with real time environment". Your memory manager is unsuitable for "real time"; and now Joe is using an "embedded OS" for playing ping pong on a system with "scores of CPUs"? Are you sure the OS isn't a general purpose OS designed for high availability servers (and is neither embedded nor real time)?

Don wrote:

Brendan wrote:

For a silly/simple example; lets say that (as part of my application) I want to log all packets sent and received over TCP/IP. I have a structure containing the details I want (dest IP address, etc) which includes a "next" field (because I'm planning to use a singly linked list of these structures) and this structure is exactly 123 bytes. For cache/TLB/data locality (performance) I want to ensure that the memory used for these structures is within the same range, so I allocate a memory pool specifically for these structures. I know (from CPU manufacturer info) that it's more efficient to align the structures on an 64-byte boundary (as this happens to be cache line size) so I'm looking at managing blocks of 128 bytes. When there are more than 5000 of these entries I want to append the oldest 2500 entries to a file on disk and free the memory the blocks used. I allocate a pool that's only large enough to allow 10000 entries to be allocated, and if that pool runs out of memory then I know that something must have gone wrong with disk IO so I terminate the application.

To get "O(1)" for everything that matters for this application (without any stupid "if(FIRST_FIT) { ....} else if(LAST_FIT) { ....}" branching/bloat); I create a pool (that is exactly "10000 * 128" bytes because that's all I need) and turn it into a singly linked list of free blocks (using the structure's "next" field that I already have). My application's "lockless" allocator (assuming 32-bit 80x86) looks like this:

Code:

count := 0
while (count < 10000) {
   (pointer, size) := allocate(HEAP, 128, EXACT_FIT, FIRST_FIT, TRIM, TAKE_TAIL, AS_REQUESTED);
   if (pointer != nil)
      result := release(HEAP, pointer, BY_LOCATION, AS_ALLOCATED)
}

creates your 10000 128 byte buffers and ensures they are "ordered" BY_LOCATION.

So I just implement my own memory manager (and use your "swiss army knife" initially to obtain memory for my application's memory manager to manage)?

Why can't I just get the memory (for the "10000 128 byte buffers ordered by location") directly from your "create_heap()" instead; and avoid calling your "spaghetti monster of bloat" 10000 times at application startup?

Don wrote:

The NEXT allocate request will yield the buffer having the lowest memory address -- and will provide it in constant time. The allocation after that will allocate the buffer having the next higher address -- in the exact same, constant time.

...and fairies will dance on the tip of a unicorn's horn each day at sunset! :roll:

Don wrote:

Brendan wrote:

I'd only free 2500 blocks at a time (and this is only ever done by the thread responsible for appending the data to the file on disk). This code would unlink the oldest 2500 entries, do the disk IO stuff, then merge the list of 2500 "now free" blocks with the allocator's list of free blocks in a single lockless "lock cmpxchg [nextBlock], ..." loop, then do a "lock sub allocatedBlocksMinus5000],2500 | (1 << 31)" to atomically adjust the number of allocated blocks and clear the "sending to disk in progress" flag at the same time.

Now see if you can tell me exactly how your "swiss army knife" would be used to implement this "application specific lockless allocator" better than the application's own custom designed allocator that I've described. Note that this is only one example (with relatively simple requirements), and an application may have/want many different allocators each specifically designed to be "as ideal as possible" for extremely different requirements/usage.

I counter (REAL example):

You avoid answering my question because (somewhere, possibly deep down in your subconscious) you know that you can't answer it without admitting that your "swiss army knife" is inferior to my "custom designed specifically for the application" allocator by many orders of magnitude.

Don wrote:

...you have live video coming off a camera on a node, "somewhere". You have to examine the video, in real time (i.e., it's an endless stream, even if you opt to buffer it somewhere, you still have to be able to process it at the same effective rate that it is being generated). You have to monitor for signs of motion in a certain portion of the image and signal an event when that is detected or persists for some period of time.

If you sense motion, you have to determine if a person is present (i.e., not just trees blowing in the breeze) and, if so, recognize his/her face. Based on who you do/don't recognize, you may have to unlock a door, connect a bidirectional audio stream between the guest and the receptionist, notify the boss that his wife has arrived, etc.

At the same time, you have to record all of this to persistent store.

And, present it for live review by an "attendant" located "in the guard shack".

Of course, Joe User shouldn't be able to eavesdrop on any of this stuff happening -- even if the resources to accomplish these tasks happen to be coming from "his" node!

And, all these "applications" are running on different device nodes scattered around the building. How many memory managers are you going to write to handle this "simple" set of applications? How much time are you going to devote to quantifying, documenting and validating each of their operations?

I'm going to implement none or more memory managers, if/where necessary to suit each specific application's specific requirements. For any/all the cases where I don't care (or care more about developer time) I'll just use whatever generic/default memory manager that the OS or the language happens to provide because I don't care about any of the details (and don't care about FIRST_FIT, LAST_FIT, ...).

Don wrote:

MIPS are dirt cheap. So is memory. Developer time is where the money gets wasted -- and user disappointment. "Gee, should I run each node at 300MHz? Or 600? 64MB of DRAM? or 256MB? One core or two? 100BaseTX or GBe?"

If I spend a few instruction fetches in a critical region... feh.

If someone has an application that only allocates a large amount of memory occasionally then the overhead is fairly irrelevant. If someone has a "many threaded" application where all threads are frequently allocating/freeing memory than the extra overhead can be a massive performance disaster.

You're writing an OS. You don't know what the application will be. You can't know if the overhead will be irrelevant or massive performance disaster (or something in between). You have to assume that it may be a massive performance disaster.

Don wrote:

Brendan wrote:

"OS designed specifically for the application" is an oxymoron. Either it's designed for one specific application and therefore it's part of one specific application and not an OS at all; or it's not designed specifically for one specific application.

Seems like an arrogant assumption. Do you think embedded devices are just "spaghetti code" simply because YOU can't load an application of your own choosing onto it? Is a math library no longer a math library because a single application is using it? Amusing when you consider there are probably two orders of magnitude MORE embedded systems in the world than all the desktop machines combined!

Is Android only an OS *if* a user opts to load some non-default application onto it? I.e., if he uses the stock phone, as is, how does that differ from using an OS in a "single" application?

I have no how idea any of this is supposed to relate to the "OS designed specifically for the application is an oxymoron" it's replying to.

Is there one specific application (and is the OS being specifically designed for this one specific application)? If yes; in which way do you consider it "not part of the application" when it's specifically designed for one specific application?

Is the OS being designed for a potentially very wide variety of very different applications? If yes; how do you design/optimise the OS for one specific application when there's a potentially very wide variety of very different applications?

Cheers,

Brendan

gerryg400 · **Posted:** Thu Apr 28, 2016 9:33 pm

Don wrote:

gerryg400 wrote:

Quote:

I elected to express the operation in terms of a "list" -- its easier for people to associate "order" and "ranking" (by some set of criteria) with a sequential entity. Also to imagine how you can look at such a structure "backwards" to get complementary results. (e.g., WORST_FIT on a list sorted BY_SIZE suggests you start looking from the "end" instead of the "beginning")

But what if there is not a list. What if you devise an algorithm where every free item is stored in a such a way that the correct size is immediately available and there is no opportunity to fit. In this case FIRST == LAST. If the user asked for WORST_FIT would you waste time looking for that when your algorithm has automatically already given you the PERFECT_FIT ?

Yes.

Actually I was asking what I thought was a rhetorical question. I didn't consider that the answer might actually be yes.

Don wrote:

Quote:

And if so, why ?

How would you issue a request for the smallest fragment in the heap? Or, the largest? Do you want to expose the sizes of all fragments to the caller directly and let him choose?

YOU know what your code is going to do with the memory. YOU know what you've already extracted from the heap. Presumably, YOU know how you want to use the current/remaining contents of that block of memory. You're not dealing with an subsystem that has to be able to fit with "generic" applications but, rather, with very specific ones.

Why not create all buffers to be exactly a page size -- on the premise that you can move them more effectively and consistently via RPC than having to marshall itsy-bitsy individual arguments? It's far easier for me to move pages between nodes than it is to move a hundred bytes up and down the protocol stack; why not force all API's to exploit big blocks of data (like moving chunks of audio and video around)?

I'm sorry I don't understand.

Don wrote:

Quote:

In this case where there are actually no lists, what do HEAD and TAIL mean ?

Depends on how the developer has elected to place released fragments into the "list". If he's asked for them to be "inserted" by location, then "TAIL" means "the fragment occurring at the highest memory address". If he's asked for them to be inserted by size, then "HEAD" means "the smallest fragment". Can you come up with a better lexicon to convey these concepts?

The problem is not lexical. It's quite simple really. You keep talking about 'lists' and 'order' and 'sorting'. Your interface is forcing you to implement an allocator that holds its free items in (perhaps notional) lists and sorts or orders them. Why ? No efficient allocator keeps sorted lists of free items. These flags make no sense for an efficient allocator. There is no 'insertion', there is no 'tail', there is no sorting by memory address or by size. If you are doing these things you are doing it wrong. Adding more and more items to your interface will never fix your design.

OSDev.org

Unified (userland) Memory Management I [long]

Who is online