Motivation:
The SingleThreadEventLoopTest allocated up to half a gigabyte of memory per run,
causing up to 50 GC runs in ~20 seconds.
Modification:
Charlie Hunt identified TLAB allocations to be the root cause of this excessive
memory usage [1]. By reusing the Executor in every test we can reduce the
memory usage by >50%.
Result:
Lower Memory Usage and fewer Garbage Collector runs.
Helps to resolve GitHub issue #2841.
[1] https://twitter.com/charlesjhunt/status/505351389317722112
Motivation:
Sometimes the SingleThreadEventLoopTest did fail on our CI. This was because of GC pressure produced by Thread.sleep(...) when interrupted as it creates a new InterruptedException all the time (and need to fill it).
Modifications:
Replace Thread.sleep(...) with LockSupport.parkNanos(...) to eliminate exception overhead.
Result:
SingleThreadEventLoopTest produce a lot less garbage.
Motivation:
To make it more easy to shutdown an EventExecutorGroup / EventLoopGroup we should let both of them extend AutoCloseable.
Modifications:
Let EventExecutorGroup extend AutoCloseable and impement it.
Result:
Easier shutdown of EventExecutorGroup and EventLoopGroup
Motivation:
Our last-minute change that made AbstractChannelHandlerContext implement PausableChannelEventExecutor broke the socksproxy example.
Modifications:
AbstractChannelHandlerContext does not inherit from PausableChannelEventExecutor anymore. Instead we'll allocate an extra object on demand.
Result:
AbstractChannelHandlerContext.executor().newPromise() returns the correct type.
Motivation:
We used the wrong EventExecutor to notify for bind failures if a late registration was done.
Modifications:
Use the correct EventExecutor to notify and only use the GlobelEventExecutor if the registration fails itself.
Result:
The correct Thread will do the notification.
When a ChannelOutboundBuffer contains ByteBufs followed by a FileRegion,
removeBytes() will fail with a ClassCastException. It should break the
loop instead.
f31c630c8c was causing
SocketGatheringWriteTest to fail because it does not take the case where
an empty buffer exists in a gathering write.
When there is an empty buffer in a gathering write, the number of
buffers returned by ChannelOutboundBuffer.nioBuffer() and the actual
number of write attemps can differ.
To remove the write requests correctly, a byte transport must use
ChannelOutboundBuffer.removeBytes()
Motivation:
Because of an incorrect logic in teh EmbeddedChannel constructor it is not possible to use EmbeddedChannel with a ChannelInitializer as constructor argument. This is because it adds the internal LastInboundHandler to its ChannelPipeline before it register itself to the EventLoop.
Modifications:
First register self to EventLoop before add LastInboundHandler to the ChannelPipeline.
Result:
It's now possible to use EmbeddedChannel with ChannelInitializer.
Motivation:
Due a regression NioSocketChannel.doWrite(...) will throw a ClassCastException if you do something like:
channel.write(bytebuf);
channel.write(fileregion);
channel.flush();
Modifications:
Correctly handle writing of different message types by using the correct message count while loop over them.
Result:
No more ClassCastException
Motivation:
The previous fix did disable the caching of ByteBuffers completely which can cause performance regressions. This fix makes sure we use nioBuffers() for all writes in NioSocketChannel and so prevent data-corruptions. This is still kind of a workaround which will be replaced by a more fundamental fix later.
Modifications:
- Revert 4059c9f354
- Use nioBuffers() for all writes to prevent data-corruption
Result:
No more data-corruption but still retain the original speed.
Motivation:
At the moment we expand the ByteBuffer[] when we have more then 1024 ByteBuffer to write and replace the stored instance in its FastThreadLocal. This is not needed and may even harm performance on linux as IOV_MAX is 1024 and so this may cause the JVM to do an array copy.
Modifications:
Just exit the nioBuffers() method if we can not fit more ByteBuffer in the array. This way we will pick them up on the next call.
Result:
Remove uncessary array copy and simplify the code.
Motivation:
We cache the ByteBuffers in ChannelOutboundBuffer.nioBuffers() for the Entries in the ChannelOutboundBuffer to reduce some overhead. The problem is this can lead to data-corruption if an incomplete write happens and next time we try to do a non-gathering write.
To fix this we should remove the caching which does not help a lot anyway and just make the code buggy.
Modifications:
Remove the caching of ByteBuffers.
Result:
No more data-corruption.
Motivation:
At the moment it's only possible for a user to set the RecvByteBufAllocator for a Channel but not access the Handle once it is assigned. This makes it hard to write more flexible implementations.
Modifications:
Add a new method to the Channel.Unsafe to allow access the the used Handle for the Channel. The RecvByteBufAllocator.Handle is created lazily.
Result:
It's possible to write more flexible implementatons that allow to adjust stuff on the fly for a Handle that is used by a Channel
Motivation:
Sometimes ChannelHandler need to queue writes to some point and then process these. We currently have no datastructure for this so the user will use an Queue or something like this. The problem is with this Channel.isWritable() will not work as expected and so the user risk to write to fast. That's exactly what happened in our SslHandler. For this purpose we need to add a special datastructure which will also take care of update the Channel and so be sure that Channel.isWritable() works as expected.
Modifications:
- Add PendingWriteQueue which can be used for this purpose
- Make use of PendingWriteQueue in SslHandler
Result:
It is now possible to queue writes in a ChannelHandler and still have Channel.isWritable() working as expected. This also fixes#2752.
Motivation:
After a channel is created it's usually assigned to an
EventLoop. During the lifetime of a Channel the
EventLoop is then responsible for processing all I/O
and compute tasks of the Channel.
For various reasons (e.g. load balancing) a user might
require the ability for a Channel to be assigned to
another EventLoop during its lifetime.
Modifications:
Introduce under the hood changes that ensure that Netty's
thread model is obeyed during and after the deregistration
of a channel.
Ensure that tasks (one time and periodic) are executed by
the right EventLoop at all times.
Result:
A Channel can be deregistered from one and re-registered with
another EventLoop.
Related issue: #2250
Motivation:
Prior to this commit, Netty's non blocking EventLoops
were each assigned a fixed thread by which all of the
EventLoop's I/O and handler logic would be performed.
While this is a fine approach for most users of Netty,
some advanced users require more flexibility in
scheduling the EventLoops.
Modifications:
Remove all direct usages of threads in MultithreadEventExecutorGroup,
SingleThreadEventExecutor et al., and introduce an Executor
abstraction instead.
The way to think about this change is, that each
iteration of an eventloop is now a task that gets scheduled
in a ForkJoinPool.
While the ForkJoinPool is the default, one also has the
ability to plug in his/her own Executor (aka thread pool)
into a EventLoop(Group).
Result:
Netty hands off thread management to a ForkJoinPool by default.
Users can also provide their own thread pool implementation and
get some control over scheduling Netty's EventLoops
Motivation:
We did various changes related to the ChannelOutboundBuffer in 4.0 branch. This commit port all of them over and so make sure our branches are synced in terms of these changes.
Related to [#2734], [#2709], [#2729], [#2710] and [#2693] .
Modification:
Port all changes that was done on the ChannelOutboundBuffer.
This includes the port of the following commits:
- 73dfd7c01b
- 997d8c32d2
- e282e504f1
- 5e5d1a58fd
- 8ee3575e72
- d6f0d12a86
- 16e50765d1
- 3f3e66c31a
Result:
- Less memory usage by ChannelOutboundBuffer
- Same code as in 4.0 branch
- Make it possible to use ChannelOutboundBuffer with Channel implementation that not extends AbstractChannel
Motivation:
The PID_MAX_LIMIT on 64bit linux systems is 4194304 and on osx it is 99998. At the moment we use 65535 as an upper-limit which is too small.
Modifications:
Use 4194304 as max possible value
Result:
No more false-positives when try to detect current pid.
Motivation:
We have some inconsistency when handling writes. Sometimes we call ChannelOutboundBuffer.progress(...) also for complete writes and sometimes not. We should call it always.
Modifications:
Correctly call ChannelOuboundBuffer.progress(...) for complete and incomplete writes.
Result:
Consistent behavior
Motivation:
While benchmarking the native transport with gathering writes I noticed that it is quite slow. This is due the fact that we need to do a lot of array copies to get the buffers into the iov array.
Modification:
Introduce a new class calles IovArray which allows to fill buffers directly in a iov array that can be passed over to JNI without any array copies. This gives a nice optimization in terms of speed when doing gathering writes.
Result:
Big performance improvement when doing gathering writes. See the included benchmark...
Before:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 23.44ms 16.37ms 259.57ms 91.77%
Req/Sec 181.99k 31.69k 304.60k 78.12%
346544071 requests in 2.00m, 46.48GB read
Requests/sec: 2887885.09
Transfer/sec: 396.59MB
With this change:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.93ms 16.33ms 305.73ms 92.34%
Req/Sec 194.56k 33.75k 309.33k 77.04%
369617503 requests in 2.00m, 49.57GB read
Requests/sec: 3080169.65
Transfer/sec: 423.00MB
Motivation:
ChannelOutboundBuffer is basically a circular array queue of its entry
objects. Once an entry is created in the array, it is never nulled out
to reduce the allocation cost.
However, because it is a circular queue, the array almost always ends up
with as many entry instances as the size of the array, regardless of the
number of pending writes.
At worst case, a channel might have only 1 pending writes at maximum
while creating 32 entry objects, where 32 is the initial capacity of the
array.
Modifications:
- Reduce the initial capacity of the circular array queue to 4.
- Make the initial capacity of the circular array queue configurable
Result:
We spend 4 times less memory for entry objects under certain
circumstances.
Motivation:
At the moment NioSocketChannelOutboundBuffer.nioBuffers() / EpollSocketChannelOutboundBuffer.memoryAddresses() returns null if something is contained in the ChannelOutboundBuffer which is not a ByteBuf. This is a problem for two reasons:
1 - In the javadocs we state that it will never return null
2 - We may do a not optimal write as there may be things that could be written via gathering writes
Modifications:
Change NioSocketChannelOutboundBuffer.nioBuffers() / EpollSocketChannelOutboundBuffer.memoryAddresses() to never return null but have it contain all ByteBuffer that were found before the non ByteBuf. This way we can do a gathering write and also conform to the javadocs.
Result:
Better speed and also correct implementation in terms of the api.
Motivation:
Now Netty has a few problems with null values.
Modifications:
- Check HAProxyProxiedProtocol in HAProxyMessage constructor and throw NPE if it is null.
If HAProxyProxiedProtocol is null we will set AddressFamily as null. So we will get NPE inside checkAddress(String, AddressFamily) and it won't be easy to understand why addrFamily is null.
- Check File in DiskFileUpload.toString().
If File is null we will get NPE when calling toString() method.
- Check Result<String> in MqttDecoder.decodeConnectionPayload(...).
If !mqttConnectVariableHeader.isWillFlag() || !mqttConnectVariableHeader.hasUserName() || !mqttConnectVariableHeader.hasPassword() we will get NPE when we will try to create new instance of MqttConnectPayload.
- Check Unsafe before calling unsafe.getClass() in PlatformDependent0 static block.
- Removed unnecessary null check in WebSocket08FrameEncoder.encode(...).
Because msg.content() can not return null.
- Removed unnecessary null check in DefaultStompFrame(StompCommand) constructor.
Because we have this check in the super class.
- Removed unnecessary null checks in ConcurrentHashMapV8.removeTreeNode(TreeNode<K,V>).
- Removed unnecessary null check in OioDatagramChannel.doReadMessages(List<Object>).
Because tmpPacket.getSocketAddress() always returns new SocketAddress instance.
- Removed unnecessary null check in OioServerSocketChannel.doReadMessages(List<Object>).
Because socket.accept() always returns new Socket instance.
- Pass Unpooled.buffer(0) instead of null inside CloseWebSocketFrame(boolean, int) constructor.
If we will pass null we will get NPE in super class constructor.
- Added throw new IllegalStateException in GlobalEventExecutor.awaitInactivity(long, TimeUnit) if it will be called before GlobalEventExecutor.execute(Runnable).
Because now we will get NPE. IllegalStateException will be better in this case.
- Fixed null check in OpenSslServerContext.setTicketKeys(byte[]).
Now we throw new NPE if byte[] is not null.
Result:
Added new null checks when it is necessary, removed unnecessary null checks and fixed some NPE problems.
Motivation:
Fix some typos in Netty.
Modifications:
- Fix potentially dangerous use of non-short-circuit logic in Recycler.transfer(Stack<?>).
- Removed double 'the the' in javadoc of EmbeddedChannel.
- Write to log an exception message if we can not get SOMAXCONN in the NetUtil's static block.
Motivation:
As a DatagramChannel supports to write to multiple remote peers we must not close the Channel once a IOException accours as this error may be only valid for one remote peer.
Modification:
Continue writing on IOException.
Result:
DatagramChannel can be used even after an IOException accours during writing.
Motivation:
Because of a missing return statement we may produce a NPE when try to fullfill the connect ChannelPromise when it was fullfilled before.
Modification:
Add missing return statement.
Result:
No more NPE.
Motivation:
When system is in short of entrophy, the initialization of
ThreadLocalRandom can take at most 3 seconds. The initialization occurs
when ThreadLocalRandom.current() is invoked first time, which might be
much later than the moment when the application has started. If we
start the initialization of ThreadLocalRandom as early as possible, we
can reduce the perceived time taken for the retrieval.
Modification:
Begin the initialization of ThreadLocalRandom in InternalLoggerFactory,
potentially one of the firstly initialized class in a Netty application.
Make DefaultChannelId retrieve the current process ID before retrieving
the current machine ID, because retrieval of a machine ID is more likely
to use ThreadLocalRandom.current().
Use a dummy channel ID for EmbeddedChannel, which prevents many unit
tests from creating a ThreadLocalRandom instance.
Result:
We gain extra 100ms at minimum for initialSeedUniquifier generation. If
an application has its own initialization that takes long enough time
and generates good amount of entrophy, it is very likely that we will
gain a lot more.
Motivation:
When a bind fails AbstractBootstrap will use the GlobalEventExecutor to notify the ChannelPromise. We should use the EventLoop of the Channel if possible.
Modification:
Use EventLoop of the Channel if possible to use the correct Thread to notify and so guaranteer the right order of events.
Result:
Use the correct EventLoop for notification
Motivation:
There is no way for a ChannelHandler to check if the passed in ChannelPromise for a write(...) call is a VoidChannelPromise. This is a problem as some handlers need to add listeners to the ChannelPromise which is not possible in the case of a VoidChannelPromise.
Modification:
- Introduce ChannelFuture.isVoid() which will return true if it is not possible to add listeners or wait on the result.
- Add ChannelPromise.unvoid() which allows to create a ChannelFuture out of a void ChannelFuture which supports all the operations.
Result:
It's now easy to write ChannelHandler implementations which also works when a void ChannelPromise is used.
Motivation:
We use the nanoTime of the scheduledTasks to calculate the milli-seconds to wait for a select operation to select something. Once these elapsed we check if there was something selected or some task is ready for processing. Unfortunally we not take into account scheduled tasks here so the selection loop will continue if only scheduled tasks are ready for processing. This will delay the execution of these tasks.
Modification:
- Check if a scheduled task is ready after selecting
- also make a tiny change in NioEventLoop to not trigger a rebuild if nothing was selected because the timeout was reached a few times in a row.
Result:
Execute scheduled tasks on time.
Motivation:
When a select rebuild was triggered the reference to the SelectionKey is not updated in AbstractNioChannel. This will cause a CancelledKeyException later.
Modification:
Correctly update SelectionKey reference after rebuild
Result:
Fix exception
Motivation:
Recycler is used in many places to reduce GC-pressure but is still not as fast as possible because of the internal datastructures used.
Modification:
- Rewrite Recycler to use a WeakOrderQueue which makes minimal guaranteer about order and visibility for max performance.
- Recycling of the same object multiple times without acquire it will fail.
- Introduce a RecyclableMpscLinkedQueueNode which can be used for MpscLinkedQueueNodes that use Recycler
These changes are based on @belliottsmith 's work that was part of #2504.
Result:
Huge increase in performance.
4.0 branch without this commit:
Benchmark (size) Mode Samples Score Score error Units
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00000 thrpt 20 116026994.130 2763381.305 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00256 thrpt 20 110823170.627 3007221.464 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 01024 thrpt 20 118290272.413 7143962.304 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 04096 thrpt 20 120560396.523 6483323.228 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 16384 thrpt 20 114726607.428 2960013.108 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 65536 thrpt 20 119385917.899 3172913.684 ops/s
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 297.617 sec - in io.netty.microbench.internal.RecyclableArrayListBenchmark
4.0 branch with this commit:
Benchmark (size) Mode Samples Score Score error Units
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00000 thrpt 20 204158855.315 5031432.145 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 00256 thrpt 20 205179685.861 1934137.841 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 01024 thrpt 20 209906801.437 8007811.254 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 04096 thrpt 20 214288320.053 6413126.689 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 16384 thrpt 20 215940902.649 7837706.133 ops/s
i.n.m.i.RecyclableArrayListBenchmark.recycleSameThread 65536 thrpt 20 211141994.206 5017868.542 ops/s
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 297.648 sec - in io.netty.microbench.internal.RecyclableArrayListBenchmark
Motivation:
LocalServerChannel.doClose() calls LocalChannelRegistry.unregister(localAddress); without check if localAddress is null and so produce a NPE when pass null the used ConcurrentHashMapV8
Modification:
Check for localAddress != null before try to remove it from Map. Also added a unit test which showed the stacktrace of the error.
Result:
No more NPE during doClose().
Motivation:
At the moment AbstractBoostrap.bind(...) will always use the GlobalEventExecutor to notify the returned ChannelFuture if the registration is not done yet. This should only be done if the registration fails later. If it completes successful we should just notify with the EventLoop of the Channel.
Modification:
Use EventLoop of the Channel if possible to use the correct Thread to notify and so guaranteer the right order of events.
Result:
Use the correct EventLoop for notification
Motivation:
When Netty runs in a managed environment such as web application server,
Netty needs to provide an explicit way to remove the thread-local
variables it created to prevent class loader leaks.
FastThreadLocal uses different execution paths for storing a
thread-local variable depending on the type of the current thread.
It increases the complexity of thread-local removal.
Modifications:
- Moved FastThreadLocal and FastThreadLocalThread out of the internal
package so that a user can use it.
- FastThreadLocal now keeps track of all thread local variables it has
initialized, and calling FastThreadLocal.removeAll() will remove all
thread-local variables of the caller thread.
- Added FastThreadLocal.size() for diagnostics and tests
- Introduce InternalThreadLocalMap which is a mixture of hard-wired
thread local variable fields and extensible indexed variables
- FastThreadLocal now uses InternalThreadLocalMap to implement a
thread-local variable.
- Added ThreadDeathWatcher.unwatch() so that PooledByteBufAllocator
tells it to stop watching when its thread-local cache has been freed
by FastThreadLocal.removeAll().
- Added FastThreadLocalTest to ensure that removeAll() works
- Added microbenchmark for FastThreadLocal and JDK ThreadLocal
- Upgraded to JMH 0.9
Result:
- A user can remove all thread-local variables Netty created, as long as
he or she did not exit from the current thread. (Note that there's no
way to remove a thread-local variable from outside of the thread.)
- FastThreadLocal exposes more useful operations such as isSet() because
we always implement a thread local variable via InternalThreadLocalMap
instead of falling back to JDK ThreadLocal.
- FastThreadLocalBenchmark shows that this change improves the
performance of FastThreadLocal even more.
Motivation:
The code in ChannelOutboundBuffer can be simplified by using AtomicLongFieldUpdater.addAndGet(...)
Modification:
Replace our manual looping with AtomicLongFieldUpdater.addAndGet(...)
Result:
Cleaner code
Motivation:
If ChannelOutboundBuffer.addFlush() is called multiple times and flushed != unflushed it will still loop through all entries that are not flushed yet even if it is not needed anymore as these were marked uncancellable before.
Modifications:
Check if new messages were added since addFlush() was called and only if this was the case loop through all entries and try to mark the uncancellable.
Result:
Less overhead when ChannelOuboundBuffer.addFlush() is called multiple times without new messages been added.
Motivation:
Provide a faster ThreadLocal implementation
Modification:
Add a "FastThreadLocal" which uses an EnumMap and a predefined fixed set of possible thread locals (all of the static instances created by netty) that is around 10-20% faster than standard ThreadLocal in my benchmarks (and can be seen having an effect in the direct PooledByteBufAllocator benchmark that uses the DEFAULT ByteBufAllocator which uses this FastThreadLocal, as opposed to normal instantiations that do not, and in the new RecyclableArrayList benchmark);
Result:
Improved performance
Motivation:
Each of DefaultChannelPipeline instance creates an head and tail that wraps a handler. These are used to chain together other DefaultChannelHandlerContext that are created once a new ChannelHandler is added. There are a few things here that can be improved in terms of memory usage and initialization time.
Modification:
- Only generate the name for the tail and head one time as it will never change anyway
- Only compute the skipFlags for the tail and head one time as it will never change
- Rename DefaultChannelHandlerContext to AbstractChannelHandlerContext and make it abstract
- Create a new DefaultChannelHandlerContext that is used when a ChannelHandler is added to the DefaultChannelPipeline
- Rename TailHandler to TailContext and HeadHandler to HeadContext and let them extend AbstractChannelHandlerContext. This way we can save 2 object creations per DefaultChannelPipeline
Result:
- Less memory usage because we have 2 less objects per DefaultChannelPipeline
- Faster creation of DefaultChannelPipeline as we not need to compute the skipFlags and not need to generate the name for the head and tail
Motivation:
At the moment ChannelFlushPromiseNotifier.add(....) takes an int value for pendingDataSize, which may be too small as a user may need to use a long. This can for example be useful when a user writes a FileRegion etc. Beside this the notify* method names are kind of missleading as these should not contain *Future* because it is about ChannelPromises.
Modification:
Alter add(...) method to take a long for pendingDataSize. Rename all *Future* methods to have *Promise* in the method name to better reflect usage.
Result:
ChannelFlushPromiseNotifier can be used with bigger data.