Motivation
Direct buffers are normally preferred when interfacing with raw
sockets. Currently netty will only return direct io buffers (for reading
from a channel) when a platform has unsafe. However, this is
inconsistent with the write-side (filterOutboundMessage) where a direct
byte buffer will be returned if pooling is enabled. This means that
environments without unsafe (and no manual netty configurations) end up
with many pooled heap byte buffers for reading, many pooled direct byte
buffers for writing, and jdk pooled byte buffers (for reading).
Modifications
This commit modifies the AbstractByteBufAllocator to return a direct
byte buffer for io handling when the platform has unsafe or direct byte
buffers are pooled.
Result:
Use direct buffers when direct buffers are pooled for IO.
Motivation:
PooledByteBufAllocator uses a PoolThreadCache per Thread that allocates / deallocates to minimize the performance overhead. This PoolThreadCache is trimmed after X allocations to free up buffers that are not allocated for a long time. This works out quite well when the app continues to allocate but fails if the app stops to allocate frequently (for whatever reason) and so a lot of memory is wasted and not given back to the arena / freed.
Modifications:
- Add a ThreadExecutorMap that offers multiple methods that wrap Runnable / ThreadFactory / Executor and allow to call ThreadExecutorMap.currentEventExecutor() to get the current executing EventExecutor for the calling Thread.
- Use these methods in the constructors of our EventExecutor implementations (which also covers the EventLoop implementations)
- Add io.netty.allocator.cacheTrimIntervalMillis system property which can be used to specify a fixed rate / interval on which we should try to trim the PoolThreadCache for a EventExecutor that allocates.
- Add PooledByteBufAllocator.trimCurrentThreadCache() to allow the user to trim the cache of the calling thread manually.
- Add testcases
- Introduce FastThreadLocal.getIfExists()
Result:
Allow to better / more frequently trim PoolThreadCache and so give back memory to the area / system.
Motivation:
We should run a CI job using J9 to ensure netty also works when using different JVMs.
Modifications:
- Adjust PooledByteBufAllocatorTest to be able to complete faster when using a JVM which takes longer when joining Threads (this seems to be the case with J9).
- Skip UDT tests on J9 as UDT is not supported there.
Result:
Be able to run CI against J9.
Motivation:
We need to update to a new checkstyle plugin to allow the usage of lambdas.
Modifications:
- Update to new plugin version.
- Fix checkstyle problems.
Result:
Be able to use checkstyle plugin which supports new Java syntax.
Motivation:
There is no guarantee that FastThreadLocal.onRemoval(...) is called if the FastThreadLocal is used by "non" FastThreacLocalThreads. This can lead to all sort of problems, like for example memory leaks as direct memory is not correctly cleaned up etc.
Beside this we use ThreadDeathWatcher to check if we need to release buffers back to the pool when thread local caches are collected. In the past ThreadDeathWatcher was used which will need to "wakeup" every second to check if the registered Threads are still alive. If we can ensure FastThreadLocal.onRemoval(...) is called we do not need this anymore.
Modifications:
- Introduce ObjectCleaner and use it to ensure FastThreadLocal.onRemoval(...) is always called when a Thread is collected.
- Deprecate ThreadDeathWatcher
- Add unit tests.
Result:
Consistent way of cleanup FastThreadLocals when a Thread is collected.
Motivation:
We dont need to use the ThreadDeathWatcher if we use a FastThreadLocalThread for which we wrap the Runnable and ensure we call FastThreadLocal.removeAll() once the Runnable completes.
Modifications:
- Dont use a ThreadDeathWatcher if we are sure we will call FastThreadLocal.removeAll()
- Add unit test.
Result:
Less overhead / running theads if you only allocate / deallocate from FastThreadLocalThreads.
Motivation:
`useCacheForAllThreads` may be false which disables memory caching
on non netty threads. Setting this argument or the system property
makes it impossible to use `PooledByteBufAllocator`.
Modifications:
Delayed the check of `freeSweepAllocationThreshold` in
`PoolThreadCache` to after it knows there will be any caches in
use. Additionally, check if the caches will have any data in them
(rather than allocating a 0-length array).
A test case is also added that fails without this change.
Results:
Fixes#7194
Motivation:
PR [#6460] added a way to access the used memory of an allocator. The used naming was not very good and how things were exposed are not consistent.
Modifications:
- Add a new ByteBufAllocatorMetric and ByteBufAllocatorMetricProvider interface
- Let the ByteBufAllocator implementations implement ByteBufAllocatorMetricProvider
- Move exposed stats / metric from PooledByteBufAllocator to PooledByteBufAllocatorMetric and mark old methods as `@Deprecated`.
Result:
More consistent way to expose metric / stats for ByteBufAllocator
Motivation:
Often its useful for the user to be able to get some stats about the memory allocated via an allocator.
Modifications:
- Allow to obtain the used heap and direct memory for an allocator
- Add test case
Result:
Fixes [#6341]
Motivation:
Commit 8dda984afe introduced a regression which lead to the situation that the allocator is not set when PooledByteBuf.initUnpooled(...) is called. Thus it was possible that PooledByteBuf.alloc() returns null or the wrong allocator if multiple PooledByteBufAllocator are used in an application.
Modifications:
- Correctly set the allocator
- Add test-case
Result:
Fixes [#6436].
Motivation:
When sun.misc.Unsafe is present we want to use *Unsafe*ByteBuf implementations. We missed to do so in PooledByteBufAllocator when the heapArena is null.
Modifications:
- Correctly use UnpooledUnsafeHeapByteBuf
- Add unit tests
Result:
Use most optimal ByteBuf implementation.
Motivation:
We should only try to calculate the direct memory offset when sun.misc.Unsafe is present as otherwise it will fail with an NPE as PlatformDependent.directBufferAddress(...) will throw it.
This problem was introduced by 66b9be3a46.
Modifications:
Use offset of 0 if no sun.misc.Unsafe is present.
Result:
PooledByteBufAllocator also works again when no sun.misc.Unsafe is present.
Motivation:
64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes.
Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code.
Modification:
At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files
Result:
Buffer alignment works better than miss-align cache.
Motivation:
We not had tests for ByteBufAllocator implementations in general.
Modifications:
Added ByteBufAllocatorTest, AbstractByteBufAllocatorTest and UnpooledByteBufAllocatorTest
Result:
More tests for allocator implementations.
Motivation:
PooledByteBufAllocatorTest uses an ArrayQueue but access it from multiple threads (not concurrently but still from different threads). This may leak to memory visibility issues.
Modifications:
- Use a concurrent queue
- Some cleanup
Result:
Non racy test code.
Motivation:
We had a few tests PooledByteBufAllocatorTests which used parkNanos(...) to give a resource enough time to get destroyed. This is race and may not be good enough.
Modifications:
Ensure the ThreadCache is really destroyed.
Result:
No more racy tests that depend on ThreadCaches.
Motivation:
Because of a bug we missed to include the first PoolSubpage when collection metrics.
Modifications:
- Correctly include all subpages
- Add unit test
Result:
Correctly include all subpages
Motivation:
Some tests in PooledByteBufAllocatorTest are blocking on a CountDownLatch. We should use a timeout on these tests so these will not block forever on a failure.
Modifications:
Add timeout param to @Test annotation
Result:
Have sane timeouts on tests.
Motivation:
PooledByteBufAllocatorTest.testNumThreadCachesWithNoDirrectArenas() had a race as it just used LockSupport.parkNanos(). We should better use a CountdownLatch and so be sure we really have init everything.
Modifications:
Replace LockSupport.parkNanos(...) with CountdownLatch usage
Result:
No more race in test.
Motivation:
When a PoolChunk needs to get moved to the previous PoolChunkList because of the minUsage / maxUsage constraints we always just moved it one level which is incorrect and so could lead to have PoolChunks in the wrong PoolChunkList (in respect to their minUsage / maxUsage settings). This then could have the effect that PoolChunks are not released / freed in a timely fashion and so.
Modifications:
- Correctly move PoolChunks between PoolChunkLists, which includes moving it multiple "levels".
- Add unit test
Result:
Correctlty move the PoolChunk to PoolChunkList when it is freed, even if its multiple layers.
Motivation:
The PoolChunkList.minUsage() and maxUsage() needs to take special action to translate Integer.MIN_VALUE / MAX_VALUE as these are used internal for tail and head of the linked-list structure.
Modifications:
- Correct the minUsage() and maxUsage() methods.
- Add unit test.
Result:
Correct metrics
Motivation:
My previous commit b88a980482 introduced a flawed unit test,
that executes an assertion in a different thread than the test thread.
If this assertion fails, the test doesn't fail.
Modifications:
Replace the assertion by a proper workaround.
Result:
More correct unit test
Motivation:
Circular assignment of arenas to thread caches can lead to less than optimal
mappings in cases where threads are (frequently) shutdown and started.
Example Scenario:
There are a total of 2 arenas. The first two threads performing an allocation
would lead to the following mapping:
Thread 0 -> Arena 0
Thread 1 -> Arena 1
Now, assume Thread 1 is shut down and another Thread 2 is started. The current
circular assignment algorithm would lead to the following mapping:
Thread 0 -> Arena 0
Thread 2 -> Arena 0
Ideally, we want Thread 2 to use Arena 1 though.
Presumably, this is not much of an issue for most Netty applications that do all
the allocations inside the eventloop, as eventloop threads are seldomly shut down
and restarted. However, applications that only use the netty-buffer package
or implement their own threading model outside the eventloop might suffer from
increased contention. For example, gRPC Java when using the blocking stub
performs some allocations outside the eventloop and within its own thread pool
that is dynamically sized depending on system load.
Modifications:
Implement a linear scan algorithm that assigns a new thread cache to the arena
that currently backs the fewest thread caches.
Result:
Closer to ideal mappings between thread caches and arenas. In order to always
get an ideal mapping, we would have to re-balance the mapping whenever a thread
dies. However, that's difficult because of deallocation.
Motivation:
Fix a race condition that was introduced by f18990a8a5 that could lead to a NPE when allocate from the PooledByteBufAllocator concurrently by many threads.
Modifications:
Correctly synchronize on the PoolSubPage head.
Result:
No more race.