Motivation:
Its not clear that the capacity is per thread.
Modifications:
Rename system property to make it more clear that the recycler capacity is per thread.
Result:
Less confusing.
Motivation:
To better restrict resource usage we should limit the number of WeakOrderQueue instances per Thread. Once this limit is reached object that are recycled from a different Thread then the allocation Thread are dropped on the floor.
Modifications:
Add new system property io.netty.recycler.maxDelayedQueuesPerThread and constructor that allows to limit the max number of WeakOrderQueue instances per Thread for Recycler instance. The default is 2 * cores (the same as the default number of EventLoop instances per EventLoopGroup).
Result:
Better way to restrict resource / memory usage per Recycler instance.
Motivation:
Some usages of findNextPositivePowerOfTwo assume that bounds checking is taken care of by this method. However bounds checking is not taken care of by findNextPositivePowerOfTwo and instead assert statements are used to imply the caller has checked the bounds. This can lead to unexpected non power of 2 return values if the caller is not careful and thus invalidate any logic which depends upon a power of 2.
Modifications:
- Add a safeFindNextPositivePowerOfTwo method which will do runtime bounds checks and always return a power of 2
Result:
Fixes https://github.com/netty/netty/issues/5601
Motivation:
At the moment the Recyler is very sensitive to allocation bursts which means that if there is a need for X objects for only one time these will most likely end up in the Recycler and sit there forever as the normal workload only need a subset of this number.
Modifications:
Add a ratio which sets how many objects should be pooled for each new allocation. This allows to slowly increase the number of objects in the Recycler while not be to sensitive for bursts.
Result:
Less unused objects in the Recycler if allocation rate sometimes bursts.
Motivation:
We used a very high number for DEFAULT_INITIAL_MAX_CAPACITY (over 200k) which is not very relastic and my lead to very surprising memory usage if allocations happen in bursts.
Modifications:
Use a more sane default value of 32k
Result:
Less possible memory usage by default
Motivation:
We use a shared capacity per Stack for all its WeakOrderQueue elements. These relations are stored in a WeakHashMap to allow dropping these if memory pressure arise. The problem is that we not "reclaim" previous reserved space when this happens. This can lead to a Stack which has not shared capacity left which then will lead to an AssertError when we try to allocate a new WeakOderQueue.
Modifications:
- Ensure we never throw an AssertError if we not have enough space left for a new WeakOrderQueue
- Ensure we reclaim space when WeakOrderQueue is collected.
Result:
No more AssertError possible when new WeakOrderQueue is created and also correctly reclaim space that was reserved from the shared capacity.
Motivation:
Commit afafadd3d7 introduced a change which stored the Stack in the WeakOrderQueue as field. This unfortunally had the effect that it was not removed from the WeakHashMap anymore as the Stack also is used as key.
Modifications:
Do not store a reference to the Stack in WeakOrderQueue.
Result:
WeakOrderQueue can be collected correctly again.
Motivation:
Currently, the recycler max capacity it's only enforced on the
thread-local stack which is used when the recycling happens on the
same thread that requested the object.
When the recycling happens in a different thread, then the objects
will be queued into a linked list (where each node holds N objects,
default=16). These objects are then transfered into the stack when
new objects are requested and the stack is empty.
The problem is that the queue doesn't have a max capacity and that
can lead to bad scenarios. Eg:
- Allocate 1M object from recycler
- Recycle all of them from different thread
- Recycler WeakOrderQueue will contain 1M objects
- Reference graph will be very long to traverse and GC timeseems to be negatively impacted
- Size of the queue will never shrink after this
Modifications:
Add some shared counter which is used to manage capacity limits when recycle from different thread then the allocation thread. We modify the counter whenever we allocate a new Link to reduce the overhead of increment / decrement it.
Result:
More predictable number of objects mantained in the recycler pool.
Motivation:
Under very unlikely (however possible) circumstances, Recycler may leak
references. This happens _only_ when the object was already recycled
at least once (which means it's got written to the stack) and then
taken out again, and never returned.
The "never returned" part may be the fault of the user (forgotten
`finally` clause) or the situation when Recycler drops the possibly
youngest item itself.
Modifications:
Nullify the item taken from the stack.
Result:
Reference is cleaned up. If the object is lost, it will be a subject for
GC. The rest of Stack / Recycler functionality remains unaffected.
Motivation:
Sometimes people may want to trade GC with memory overhead. For this it can be useful to allow to change the capacity of the array that is hold in the Link that is used by the Recycler internally.
Modifications:
Introduce a new system property , io.netty.recycler.linkCapacity which allows to change the capcity.
Result:
More flexible configuration of netty.
Motivation:
Recycler.recycle(...) should not be used anymore and be replaced by Handle.recycle().
Modifications:
Mark it as deprecated and update usage.
Result:
Correctly document deprecated api.
Motivation:
It should be possible to disable the Recycler with -Dio.netty.recycler.maxCapacity=0, but because of a typo this is not the case.
Modifications:
Replace <= with < to make it posible to disable the Recycler.
Result:
Correct behaviour when using -Dio.netty.recycler.maxCapacity=0
Motivation:
Fix a typo in the log message of the static initializer of Recycler.
Modifications:
Fix typo.
Result:
Correctly log system property io.netty.recycler.maxCapacity.
Motivation:
Sometimes it is useful to disable recycling completely if memory constraints are very tight.
Modifications:
Allow to use -Dio.netty.recycler.maxCapacity=0 to disable recycling completely.
Result:
It's possible to disable recycling now.
Related: #3166
Motivation:
When the recyclable object created at one thread is returned at the
other thread, it is stored in a WeakOrderedQueue.
The objects stored in the WeakOrderedQueue is added back to the stack by
WeakOrderedQueue.transfer() when the owner thread ran out of recyclable
objects.
However, WeakOrderedQueue.transfer() does not have any mechanism that
prevents the stack from growing beyond its maximum capacity.
Modifications:
- Make WeakOrderedQueue.transfer() increase the capacity of the stack
only up to its maximum
- Add tests for the cases where the recyclable object is returned at the
non-owner thread
- Fix a bug where Stack.scavengeSome() does not scavenge the objects
when it's the first time it ran out of objects and thus its cursor is
null.
- Overall clean-up of scavengeSome() and transfer()
Result:
The capacity of Stack never increases beyond its maximum.
Motivation:
This fixes bug #2848 which caused Recycler to become unbounded and cache infinite number of objects with maxCapacity that's not a power of two. This can result in general sluggishness of the application and OutOfMemoryError.
Modifications:
The test for maxCapacity has been moved out of test to check if the buffer has filled. The buffer is now also capped at maxCapacity and cannot grow over it as it jumps from one power of two to the other.
Additionally, a unit test was added to verify maxCapacity is honored even when it's not a power of two.
Result:
With these changes the user is able to use a custom maxCapacity number and not have it ignored. The unit test assures this bug will not repeat itself.
Motivation:
Fix some typos in Netty.
Modifications:
- Fix potentially dangerous use of non-short-circuit logic in Recycler.transfer(Stack<?>).
- Removed double 'the the' in javadoc of EmbeddedChannel.
- Write to log an exception message if we can not get SOMAXCONN in the NetUtil's static block.
Motivation:
Recycler is used in many places to reduce GC-pressure but is still not as fast as possible because of the internal datastructures used.
Modification:
- Rewrite Recycler to use a WeakOrderQueue which makes minimal guaranteer about order and visibility for max performance.
- Recycling of the same object multiple times without acquire it will fail.
- Introduce a RecyclableMpscLinkedQueueNode which can be used for MpscLinkedQueueNodes that use Recycler
These changes are based on @belliottsmith 's work that was part of #2504.
Result:
Huge increase in performance.
4.0 branch without this commit:
Benchmark (size) Mode Samples Score Score error Units
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00000 thrpt 20 116026994.130 2763381.305 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00256 thrpt 20 110823170.627 3007221.464 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 01024 thrpt 20 118290272.413 7143962.304 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 04096 thrpt 20 120560396.523 6483323.228 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 16384 thrpt 20 114726607.428 2960013.108 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 65536 thrpt 20 119385917.899 3172913.684 ops/s
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 297.617 sec - in io.netty.microbench.internal.RecyclableArrayListBenchmark
4.0 branch with this commit:
Benchmark (size) Mode Samples Score Score error Units
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00000 thrpt 20 204158855.315 5031432.145 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00256 thrpt 20 205179685.861 1934137.841 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 01024 thrpt 20 209906801.437 8007811.254 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 04096 thrpt 20 214288320.053 6413126.689 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 16384 thrpt 20 215940902.649 7837706.133 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 65536 thrpt 20 211141994.206 5017868.542 ops/s
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 297.648 sec - in io.netty.microbench.internal.RecyclableArrayListBenchmark
Motivation:
When Netty runs in a managed environment such as web application server,
Netty needs to provide an explicit way to remove the thread-local
variables it created to prevent class loader leaks.
FastThreadLocal uses different execution paths for storing a
thread-local variable depending on the type of the current thread.
It increases the complexity of thread-local removal.
Modifications:
- Moved FastThreadLocal and FastThreadLocalThread out of the internal
package so that a user can use it.
- FastThreadLocal now keeps track of all thread local variables it has
initialized, and calling FastThreadLocal.removeAll() will remove all
thread-local variables of the caller thread.
- Added FastThreadLocal.size() for diagnostics and tests
- Introduce InternalThreadLocalMap which is a mixture of hard-wired
thread local variable fields and extensible indexed variables
- FastThreadLocal now uses InternalThreadLocalMap to implement a
thread-local variable.
- Added ThreadDeathWatcher.unwatch() so that PooledByteBufAllocator
tells it to stop watching when its thread-local cache has been freed
by FastThreadLocal.removeAll().
- Added FastThreadLocalTest to ensure that removeAll() works
- Added microbenchmark for FastThreadLocal and JDK ThreadLocal
- Upgraded to JMH 0.9
Result:
- A user can remove all thread-local variables Netty created, as long as
he or she did not exit from the current thread. (Note that there's no
way to remove a thread-local variable from outside of the thread.)
- FastThreadLocal exposes more useful operations such as isSet() because
we always implement a thread local variable via InternalThreadLocalMap
instead of falling back to JDK ThreadLocal.
- FastThreadLocalBenchmark shows that this change improves the
performance of FastThreadLocal even more.
Motivation:
Provide a faster ThreadLocal implementation
Modification:
Add a "FastThreadLocal" which uses an EnumMap and a predefined fixed set of possible thread locals (all of the static instances created by netty) that is around 10-20% faster than standard ThreadLocal in my benchmarks (and can be seen having an effect in the direct PooledByteBufAllocator benchmark that uses the DEFAULT ByteBufAllocator which uses this FastThreadLocal, as opposed to normal instantiations that do not, and in the new RecyclableArrayList benchmark);
Result:
Improved performance
Motivation:
'io.netty.recycler.maxCapacity.default' is the only property for recycler's default maximum capacity, so having the 'default' suffix only increases the length of the property name.
Modifications:
Rename "io.netty.recycler.maxCapacity.default" to "io.netty.recycler.maxCapacity"
Result:
Shorter system property name. The future addition of system properties, such as io.netty.recycler.maxCapacity.outboundBuffer, are not confusing either.
Motivation:
- As reported recently [1], Recycler's thread-local object pool has unbounded capacity which is a potential problem.
- It accesses a hash table on each push and pop for debugging purposes. We don't really need it besides debugging Netty itself.
Modifications:
- Introduced the maxCapacity constructor parameter to Recycler. The default default maxCapacity is retrieved from the system property whose default is 256K, which should be plenty for most cases.
- Recycler.Stack.map is now created and accessed only when assertion is enabled for Recycler.
Result:
- Recycler does not grow infinitely anymore.
- If assertion is disabled, Recycler should be much faster.
[1] https://github.com/netty/netty/issues/1841
The API changes made so far turned out to increase the memory footprint
and consumption while our intention was actually decreasing them.
Memory consumption issue:
When there are many connections which does not exchange data frequently,
the old Netty 4 API spent a lot more memory than 3 because it always
allocates per-handler buffer for each connection unless otherwise
explicitly stated by a user. In a usual real world load, a client
doesn't always send requests without pausing, so the idea of having a
buffer whose life cycle if bound to the life cycle of a connection
didn't work as expected.
Memory footprint issue:
The old Netty 4 API decreased overall memory footprint by a great deal
in many cases. It was mainly because the old Netty 4 API did not
allocate a new buffer and event object for each read. Instead, it
created a new buffer for each handler in a pipeline. This works pretty
well as long as the number of handlers in a pipeline is only a few.
However, for a highly modular application with many handlers which
handles connections which lasts for relatively short period, it actually
makes the memory footprint issue much worse.
Changes:
All in all, this is about retaining all the good changes we made in 4 so
far such as better thread model and going back to the way how we dealt
with message events in 3.
To fix the memory consumption/footprint issue mentioned above, we made a
hard decision to break the backward compatibility again with the
following changes:
- Remove MessageBuf
- Merge Buf into ByteBuf
- Merge ChannelInboundByte/MessageHandler and ChannelStateHandler into ChannelInboundHandler
- Similar changes were made to the adapter classes
- Merge ChannelOutboundByte/MessageHandler and ChannelOperationHandler into ChannelOutboundHandler
- Similar changes were made to the adapter classes
- Introduce MessageList which is similar to `MessageEvent` in Netty 3
- Replace inboundBufferUpdated(ctx) with messageReceived(ctx, MessageList)
- Replace flush(ctx, promise) with write(ctx, MessageList, promise)
- Remove ByteToByteEncoder/Decoder/Codec
- Replaced by MessageToByteEncoder<ByteBuf>, ByteToMessageDecoder<ByteBuf>, and ByteMessageCodec<ByteBuf>
- Merge EmbeddedByteChannel and EmbeddedMessageChannel into EmbeddedChannel
- Add SimpleChannelInboundHandler which is sometimes more useful than
ChannelInboundHandlerAdapter
- Bring back Channel.isWritable() from Netty 3
- Add ChannelInboundHandler.channelWritabilityChanges() event
- Add RecvByteBufAllocator configuration property
- Similar to ReceiveBufferSizePredictor in Netty 3
- Some existing configuration properties such as
DatagramChannelConfig.receivePacketSize is gone now.
- Remove suspend/resumeIntermediaryDeallocation() in ByteBuf
This change would have been impossible without @normanmaurer's help. He
fixed, ported, and improved many parts of the changes.