Improve the allocation algorithm in PoolChunk

Motivation:

Depth-first search is not always efficient for buddy allocation.

Modification:

Employ a new faster search algorithm with different memoryMap layout.

Result:

With thread-local cache disabled, we see a lot of performance
improvment, especially when the size of the allocation is as small as
the page size, which had the largest search space previously:

-- master head --
Benchmark                (size) Mode    Score  Error Units
pooledDirectAllocAndFree  8192 thrpt  215.392  1.565 ops/ms
pooledDirectAllocAndFree 16384 thrpt  594.625  2.154 ops/ms
pooledDirectAllocAndFree 65536 thrpt 1221.520 18.965 ops/ms
pooledHeapAllocAndFree    8192 thrpt  217.175  1.653 ops/ms
pooledHeapAllocAndFree   16384 thrpt  587.250 14.827 ops/ms
pooledHeapAllocAndFree   65536 thrpt 1217.023 44.963 ops/ms

-- changes --
Benchmark                (size) Mode    Score  Error Units
pooledDirectAllocAndFree  8192 thrpt 3656.744 94.093 ops/ms
pooledDirectAllocAndFree 16384 thrpt 4087.152 22.921 ops/ms
pooledDirectAllocAndFree 65536 thrpt 4058.814 29.276 ops/ms
pooledHeapAllocAndFree    8192 thrpt 3640.355 44.418 ops/ms
pooledHeapAllocAndFree   16384 thrpt 4030.206 24.365 ops/ms
pooledHeapAllocAndFree   65536 thrpt 4103.991 70.991 ops/ms
This commit is contained in:
Pavan Kumar 2014-06-18 19:57:20 -07:00 committed by Trustin Lee
parent 58f4b4b7d9
commit 004ffbad90

View File

@ -16,25 +16,124 @@
package io.netty.buffer; package io.netty.buffer;
import io.netty.util.collection.IntObjectHashMap;
/**
* Description of algorithm for PageRun/PoolSubpage allocation from PoolChunk
*
* Notation: The following terms are important to understand the code
* > page - a page is the smallest unit of memory chunk that can be allocated
* > chunk - a chunk is a collection of pages
* > in this code chunkSize = 2^{maxOrder} * pageSize
*
* To begin we allocate a byte array of size = chunkSize
* Whenever a ByteBuf of given size needs to be created we search for the first position
* in the byte array that has enough empty space to accommodate the requested size and
* return a (long) handle that encodes this offset information, (this memory segment is then
* marked as reserved so it is always used by exactly one ByteBuf and no more)
*
* For simplicity all sizes are normalized according to PoolArena#normalizeCapacity method
* This ensures that when we request for memory segments of size >= pageSize the normalizedCapacity
* equals the next nearest power of 2
*
* To search for the first offset in chunk that has at least requested size available we construct a
* complete balanced binary tree and store it in an array (just like heaps) - memoryMap
*
* The tree looks like this (the size of each node being mentioned in the parenthesis)
*
* depth=0 1 node (chunkSize)
* depth=1 2 nodes (chunkSize/2)
* ..
* ..
* depth=d 2^d nodes (chunkSize/2^d)
* ..
* depth=maxOrder 2^maxOrder nodes (chunkSize/2^{maxOrder} = pageSize)
*
* depth=maxOrder is the last level and the leafs consist of pages
*
* With this tree available searching in chunkArray translates like this:
* To allocate a memory segment of size chunkSize/2^k we search for the first node (from left) at height k
* which is unused
*
* Algorithm:
* ----------
* Encode the tree in memoryMap with the notation
* memoryMap[id] = x => in the subtree rooted at id, the first node that is free to be allocated
* is at depth x (counted from depth=0) i.e., at depths [depth_of_id, x), there is no node that is free
*
* As we allocate & free nodes, we update values stored in memoryMap so that the property is maintained
*
* Initialization -
* In the beginning we construct the memoryMap array by storing the depth of a node at each node
* i.e., memoryMap[id] = depth_of_id
*
* Observations:
* -------------
* 1) memoryMap[id] = depth_of_id => it is free / unallocated
* 2) memoryMap[id] > depth_of_id => at least one of its child nodes is allocated, so we cannot allocate it, but
* some of its children can still be allocated based on their availability
* 3) memoryMap[id] = maxOrder + 1 => the node is fully allocated & thus none of its children can be allocated, it
* is thus marked as unusable
*
* Algorithm: [allocateNode(d) => we want to find the first node (from left) at height h that can be allocated]
* ----------
* 1) start at root (i.e., depth = 0 or id = 1)
* 2) if memoryMap[1] > d => cannot be allocated from this chunk
* 3) if left node value <= h; we can allocate from left subtree so move to left and repeat until found
* 4) else try in right subtree
*
* Algorithm: [allocateRun(size)]
* ----------
* 1) Compute d = log_2(chunkSize/size)
* 2) Return allocateNode(d)
*
* Algorithm: [allocateSubpage(size)]
* ----------
* All subpages allocated are stored in a map at key = elemSize
* 1) if subpage at elemSize != null: try allocating from it.
* if it fails: allocateSubpageSimple
* 2) else: just allocateSubpageSimple
*
* Algorithm: [allocateSubpageSimple(size)]
* ----------
* 1) use allocateRun(maxOrder) to find an empty (i.e., unused) leaf (i.e., page)
* 2) use this handle to construct the poolsubpage object or if it already exists just initialize it
* with required normCapacity
* 3) store (insert/ overwrite) the subpage in elemSubpages map for easier access
*
* Note:
* -----
* In the implementation for improving cache coherence,
* we store 2 pieces of information (i.e, 2 byte vals) as a short value in memoryMap
*
* memoryMap[id]= (depth_of_id, x)
* where as per convention defined above
* the second value (i.e, x) indicates that the first node which is free to be allocated is at depth x (from root)
*/
final class PoolChunk<T> { final class PoolChunk<T> {
private static final int ST_UNUSED = 0;
private static final int ST_BRANCH = 1; private static final int BYTE_LENGTH = 8;
private static final int ST_ALLOCATED = 2; private static final int BYTE_MASK = 0xFF;
private static final int ST_ALLOCATED_SUBPAGE = 3; private static final int INV_BYTE_MASK = ~ BYTE_MASK;
final PoolArena<T> arena; final PoolArena<T> arena;
final T memory; final T memory;
final boolean unpooled; final boolean unpooled;
private final int[] memoryMap; private final short[] memoryMap;
private final PoolSubpage<T>[] subpages; private final PoolSubpage<T>[] subpages;
private final IntObjectHashMap<PoolSubpage<T>> elemSubpages;
/** Used to determine if the requested capacity is equal to or greater than pageSize. */ /** Used to determine if the requested capacity is equal to or greater than pageSize. */
private final int subpageOverflowMask; private final int subpageOverflowMask;
private final int pageSize; private final int pageSize;
private final int pageShifts; private final int pageShifts;
private final int maxOrder;
private final int chunkSize; private final int chunkSize;
private final int log2ChunkSize;
private final int maxSubpageAllocs; private final int maxSubpageAllocs;
/** Used to mark memory as unusable */
private final byte unusable;
private int freeBytes; private int freeBytes;
@ -51,25 +150,31 @@ final class PoolChunk<T> {
this.memory = memory; this.memory = memory;
this.pageSize = pageSize; this.pageSize = pageSize;
this.pageShifts = pageShifts; this.pageShifts = pageShifts;
this.maxOrder = maxOrder;
this.chunkSize = chunkSize; this.chunkSize = chunkSize;
unusable = (byte) (maxOrder + 1);
log2ChunkSize = Integer.SIZE - 1 - Integer.numberOfLeadingZeros(chunkSize);
subpageOverflowMask = ~(pageSize - 1); subpageOverflowMask = ~(pageSize - 1);
freeBytes = chunkSize; freeBytes = chunkSize;
int chunkSizeInPages = chunkSize >>> pageShifts; assert maxOrder < 30 : "maxOrder should be < 30, but is : " + maxOrder;
maxSubpageAllocs = 1 << maxOrder; maxSubpageAllocs = 1 << maxOrder;
// Generate the memory map. // Generate the memory map.
memoryMap = new int[maxSubpageAllocs << 1]; memoryMap = new short[maxSubpageAllocs << 1];
int memoryMapIndex = 1; int memoryMapIndex = 1;
for (int i = 0; i <= maxOrder; i ++) { for (int d = 0; d <= maxOrder; ++d) { // move down the tree one level at a time
int runSizeInPages = chunkSizeInPages >>> i; short dd = (short) ((d << BYTE_LENGTH) | d);
for (int j = 0; j < chunkSizeInPages; j += runSizeInPages) { for (int p = 0; p < (1 << d); ++p) {
//noinspection PointlessBitwiseExpression // in each level traverse left to right and set the depth of subtree
memoryMap[memoryMapIndex ++] = j << 17 | runSizeInPages << 2 | ST_UNUSED; // that is completely free to be my depth since I am totally free to start with
memoryMap[memoryMapIndex] = dd;
memoryMapIndex += 1;
} }
} }
subpages = newSubpageArray(maxSubpageAllocs); subpages = newSubpageArray(maxSubpageAllocs);
elemSubpages = new IntObjectHashMap<PoolSubpage<T>>(pageShifts);
} }
/** Creates a special chunk that is not pooled. */ /** Creates a special chunk that is not pooled. */
@ -79,10 +184,14 @@ final class PoolChunk<T> {
this.memory = memory; this.memory = memory;
memoryMap = null; memoryMap = null;
subpages = null; subpages = null;
elemSubpages = null;
subpageOverflowMask = 0; subpageOverflowMask = 0;
pageSize = 0; pageSize = 0;
pageShifts = 0; pageShifts = 0;
maxOrder = 0;
unusable = (byte) (maxOrder + 1);
chunkSize = size; chunkSize = size;
log2ChunkSize = Integer.SIZE - 1 - Integer.numberOfLeadingZeros(chunkSize);
maxSubpageAllocs = 0; maxSubpageAllocs = 0;
} }
@ -104,218 +213,141 @@ final class PoolChunk<T> {
} }
long allocate(int normCapacity) { long allocate(int normCapacity) {
int firstVal = memoryMap[1];
if ((normCapacity & subpageOverflowMask) != 0) { // >= pageSize if ((normCapacity & subpageOverflowMask) != 0) { // >= pageSize
return allocateRun(normCapacity, 1, firstVal); return allocateRun(normCapacity);
} else { } else {
return allocateSubpage(normCapacity, 1, firstVal); return allocateSubpage(normCapacity);
} }
} }
private long allocateRun(int normCapacity, int curIdx, int val) { /**
switch (val & 3) { * Update method used by allocate
case ST_UNUSED: * This is triggered only when a successor is allocated and all its predecessors
return allocateRunSimple(normCapacity, curIdx, val); * need to update their state
case ST_BRANCH: * The minimal depth at which subtree rooted at id has some free space
// Try the right node first because it is more likely to be ST_UNUSED. * @param id id
// It is because allocateRunSimple() always chooses the left node. */
final int nextIdxLeft = curIdx << 1; private void updateParentsAlloc(int id) {
final int nextIdxRight = nextIdxLeft ^ 1; while (id > 1) {
final int nextValRight = memoryMap[nextIdxRight]; int parentId = id >>> 1;
final boolean recurseRight; byte mem1 = value(id);
byte mem2 = value(id ^ 1);
switch (nextValRight & 3) { byte mem = mem1 < mem2 ? mem1 : mem2;
case ST_UNUSED: setVal(parentId, mem);
return allocateRunSimple(normCapacity, nextIdxRight, nextValRight); id = parentId;
case ST_BRANCH:
recurseRight = true;
break;
default:
recurseRight = false;
}
final int nextValLeft = memoryMap[nextIdxLeft];
final boolean recurseLeft;
switch (nextValLeft & 3) {
case ST_UNUSED:
return allocateRunSimple(normCapacity, nextIdxLeft, nextValLeft);
case ST_BRANCH:
recurseLeft = true;
break;
default:
recurseLeft = false;
}
if (recurseRight) {
long res = branchRun(normCapacity, nextIdxRight);
if (res > 0) {
return res;
}
}
if (recurseLeft) {
return branchRun(normCapacity, nextIdxLeft);
}
} }
return -1;
} }
private long branchRun(int normCapacity, int nextIdx) { /**
int nextNextIdx = nextIdx << 1; * Update method used by free
int nextNextVal = memoryMap[nextNextIdx]; * This needs to handle the special case when both children are completely free
long res = allocateRun(normCapacity, nextNextIdx, nextNextVal); * in which case parent be directly allocated on request of size = child-size * 2
if (res > 0) { * @param id id
return res; */
} private void updateParentsFree(int id) {
int logChild = depth(id) + 1;
while (id > 1) {
int parentId = id >>> 1;
byte mem1 = value(id);
byte mem2 = value(id ^ 1);
byte mem = mem1 < mem2 ? mem1 : mem2;
setVal(parentId, mem);
logChild -= 1; // in first iteration equals log, subsequently reduce 1 from logChild as we traverse up
nextNextIdx ^= 1; if (mem1 == logChild && mem2 == logChild) {
nextNextVal = memoryMap[nextNextIdx]; setVal(parentId, (byte) (logChild - 1));
return allocateRun(normCapacity, nextNextIdx, nextNextVal); }
id = parentId;
}
} }
private long allocateRunSimple(int normCapacity, int curIdx, int val) { private int allocateNode(int d) {
int runLength = runLength(val); int id = 1;
if (normCapacity > runLength) { byte mem = value(id);
if (mem > d) { // unusable
return -1; return -1;
} }
while (mem < d || (id & (1 << d)) == 0) {
for (;;) { id = id << 1;
if (normCapacity == runLength) { mem = value(id);
// Found the run that fits. if (mem > d) {
// Note that capacity has been normalized already, so we don't need to deal with id = id ^ 1;
// the values that are not power of 2. mem = value(id);
memoryMap[curIdx] = val & ~3 | ST_ALLOCATED;
freeBytes -= runLength;
return curIdx;
} }
int nextIdx = curIdx << 1;
int unusedIdx = nextIdx ^ 1;
memoryMap[curIdx] = val & ~3 | ST_BRANCH;
//noinspection PointlessBitwiseExpression
memoryMap[unusedIdx] = memoryMap[unusedIdx] & ~3 | ST_UNUSED;
runLength >>>= 1;
curIdx = nextIdx;
val = memoryMap[curIdx];
} }
setVal(id, unusable); // mark as unusable : because, maximum input d = maxOrder
updateParentsAlloc(id);
return id;
} }
private long allocateSubpage(int normCapacity, int curIdx, int val) { private long allocateRun(int normCapacity) {
switch (val & 3) { int numPages = normCapacity >>> pageShifts;
case ST_UNUSED: int d = maxOrder - (Integer.SIZE - 1 - Integer.numberOfLeadingZeros(numPages));
return allocateSubpageSimple(normCapacity, curIdx, val); int id = allocateNode(d);
case ST_BRANCH: if (id < 0) {
// Try the right node first because it is more likely to be ST_UNUSED. return id;
// It is because allocateSubpageSimple() always chooses the left node.
final int nextIdxLeft = curIdx << 1;
final int nextIdxRight = nextIdxLeft ^ 1;
long res = branchSubpage(normCapacity, nextIdxRight);
if (res > 0) {
return res;
}
return branchSubpage(normCapacity, nextIdxLeft);
case ST_ALLOCATED_SUBPAGE:
PoolSubpage<T> subpage = subpages[subpageIdx(curIdx)];
int elemSize = subpage.elemSize;
if (normCapacity != elemSize) {
return -1;
}
return subpage.allocate();
} }
freeBytes -= runLength(id);
return -1; return id;
} }
private long allocateSubpageSimple(int normCapacity, int curIdx, int val) { private long allocateSubpage(int normCapacity) {
int runLength = runLength(val); PoolSubpage<T> subpage = elemSubpages.get(normCapacity);
for (;;) { if (subpage != null) {
if (runLength == pageSize) { long handle = subpage.allocate();
memoryMap[curIdx] = val & ~3 | ST_ALLOCATED_SUBPAGE; if (handle >= 0) {
freeBytes -= runLength; return handle;
int subpageIdx = subpageIdx(curIdx);
PoolSubpage<T> subpage = subpages[subpageIdx];
if (subpage == null) {
subpage = new PoolSubpage<T>(this, curIdx, runOffset(val), pageSize, normCapacity);
subpages[subpageIdx] = subpage;
} else {
subpage.init(normCapacity);
}
return subpage.allocate();
} }
// if subpage full (i.e., handle < 0) then replace in elemSubpage with new subpage
int nextIdx = curIdx << 1;
int unusedIdx = nextIdx ^ 1;
memoryMap[curIdx] = val & ~3 | ST_BRANCH;
//noinspection PointlessBitwiseExpression
memoryMap[unusedIdx] = memoryMap[unusedIdx] & ~3 | ST_UNUSED;
runLength >>>= 1;
curIdx = nextIdx;
val = memoryMap[curIdx];
} }
return allocateSubpageSimple(normCapacity);
} }
private long branchSubpage(int normCapacity, int nextIdx) { private long allocateSubpageSimple(int normCapacity) {
int nextVal = memoryMap[nextIdx]; int d = maxOrder; // subpages are only be allocated from pages i.e., leaves
if ((nextVal & 3) != ST_ALLOCATED) { int id = allocateNode(d);
return allocateSubpage(normCapacity, nextIdx, nextVal); if (id < 0) {
return id;
} }
return -1; freeBytes -= pageSize;
int subpageIdx = subpageIdx(id);
PoolSubpage<T> subpage = subpages[subpageIdx];
if (subpage == null) {
subpage = new PoolSubpage<T>(this, id, runOffset(id), pageSize, normCapacity);
subpages[subpageIdx] = subpage;
} else {
subpage.init(normCapacity);
}
elemSubpages.put(normCapacity, subpage); // store subpage at proper elemSize pos
return subpage.allocate();
} }
void free(long handle) { void free(long handle) {
int memoryMapIdx = (int) handle; int memoryMapIdx = (int) handle;
int bitmapIdx = (int) (handle >>> 32); int bitmapIdx = (int) (handle >>> 32);
int val = memoryMap[memoryMapIdx]; if (bitmapIdx != 0) { // free a subpage
int state = val & 3;
if (state == ST_ALLOCATED_SUBPAGE) {
assert bitmapIdx != 0;
PoolSubpage<T> subpage = subpages[subpageIdx(memoryMapIdx)]; PoolSubpage<T> subpage = subpages[subpageIdx(memoryMapIdx)];
assert subpage != null && subpage.doNotDestroy; assert subpage != null && subpage.doNotDestroy;
if (subpage.free(bitmapIdx & 0x3FFFFFFF)) { if (subpage.free(bitmapIdx & 0x3FFFFFFF)) {
return; return;
} }
} else {
assert state == ST_ALLOCATED : "state: " + state;
assert bitmapIdx == 0;
} }
freeBytes += runLength(val); freeBytes += runLength(memoryMapIdx);
setVal(memoryMapIdx, depth(memoryMapIdx));
for (;;) { updateParentsFree(memoryMapIdx);
//noinspection PointlessBitwiseExpression
memoryMap[memoryMapIdx] = val & ~3 | ST_UNUSED;
if (memoryMapIdx == 1) {
assert freeBytes == chunkSize;
return;
}
if ((memoryMap[siblingIdx(memoryMapIdx)] & 3) != ST_UNUSED) {
break;
}
memoryMapIdx = parentIdx(memoryMapIdx);
val = memoryMap[memoryMapIdx];
}
} }
void initBuf(PooledByteBuf<T> buf, long handle, int reqCapacity) { void initBuf(PooledByteBuf<T> buf, long handle, int reqCapacity) {
int memoryMapIdx = (int) handle; int memoryMapIdx = (int) handle;
int bitmapIdx = (int) (handle >>> 32); int bitmapIdx = (int) (handle >>> 32);
if (bitmapIdx == 0) { if (bitmapIdx == 0) {
int val = memoryMap[memoryMapIdx]; byte val = value(memoryMapIdx);
assert (val & 3) == ST_ALLOCATED : String.valueOf(val & 3); assert val == (maxOrder + 1) : String.valueOf(val);
buf.init(this, handle, runOffset(val), reqCapacity, runLength(val)); buf.init(this, handle, runOffset(memoryMapIdx), reqCapacity, runLength(memoryMapIdx));
} else { } else {
initBufWithSubpage(buf, handle, bitmapIdx, reqCapacity); initBufWithSubpage(buf, handle, bitmapIdx, reqCapacity);
} }
@ -329,38 +361,45 @@ final class PoolChunk<T> {
assert bitmapIdx != 0; assert bitmapIdx != 0;
int memoryMapIdx = (int) handle; int memoryMapIdx = (int) handle;
int val = memoryMap[memoryMapIdx];
assert (val & 3) == ST_ALLOCATED_SUBPAGE;
PoolSubpage<T> subpage = subpages[subpageIdx(memoryMapIdx)]; PoolSubpage<T> subpage = subpages[subpageIdx(memoryMapIdx)];
assert subpage.doNotDestroy; assert subpage.doNotDestroy;
assert reqCapacity <= subpage.elemSize; assert reqCapacity <= subpage.elemSize;
buf.init( buf.init(
this, handle, this, handle,
runOffset(val) + (bitmapIdx & 0x3FFFFFFF) * subpage.elemSize, reqCapacity, subpage.elemSize); runOffset(memoryMapIdx) + (bitmapIdx & 0x3FFFFFFF) * subpage.elemSize, reqCapacity, subpage.elemSize);
} }
private static int parentIdx(int memoryMapIdx) { private byte value(int id) {
return memoryMapIdx >>> 1; return (byte) (memoryMap[id] & BYTE_MASK);
} }
private static int siblingIdx(int memoryMapIdx) { private void setVal(int id, byte val) {
return memoryMapIdx ^ 1; memoryMap[id] = (short) ((memoryMap[id] & INV_BYTE_MASK) | val);
} }
private int runLength(int val) { private byte depth(int id) {
return (val >>> 2 & 0x7FFF) << pageShifts; short val = memoryMap[id];
return (byte) (val >>> BYTE_LENGTH);
} }
private int runOffset(int val) { private int runLength(int id) {
return val >>> 17 << pageShifts; // represents the size in #bytes supported by node 'id' in the tree
return 1 << (log2ChunkSize - depth(id));
}
private int runOffset(int id) {
// represents the 0-based offset in #bytes from start of the byte-array chunk
int shift = id - (1 << depth(id));
return shift * runLength(id);
} }
private int subpageIdx(int memoryMapIdx) { private int subpageIdx(int memoryMapIdx) {
return memoryMapIdx - maxSubpageAllocs; return memoryMapIdx - maxSubpageAllocs;
} }
@Override
public String toString() { public String toString() {
StringBuilder buf = new StringBuilder(); StringBuilder buf = new StringBuilder();
buf.append("Chunk("); buf.append("Chunk(");