Motivation:
UnpooledUnsafeDirectByteBuf.setBytes(int,ByteBuf,int,int) fails to use fast-path when src uses an array as backing storage. This is because the if else uses the wrong ByteBuf for its check.
Modifications:
- Use correct ByteBuf when check for array as backing storage
- Also eliminate unecessary check in UnpooledDirectByteBuf which always fails anyway
Result:
Faster setBytes(...) when src ByteBuf is backed by an array.
No more IndexOutOfBoundsException or data-corruption.
Motivation:
Provide a faster ThreadLocal implementation
Modification:
Add a "FastThreadLocal" which uses an EnumMap and a predefined fixed set of possible thread locals (all of the static instances created by netty) that is around 10-20% faster than standard ThreadLocal in my benchmarks (and can be seen having an effect in the direct PooledByteBufAllocator benchmark that uses the DEFAULT ByteBufAllocator which uses this FastThreadLocal, as opposed to normal instantiations that do not, and in the new RecyclableArrayList benchmark);
Result:
Improved performance
Motivation:
Our Unsafe*ByteBuf implementation always invert bytes when the native ByteOrder is LITTLE_ENDIAN (this is true on intel), even when the user calls order(ByteOrder.LITTLE_ENDIAN). This is not optimal for performance reasons as the user should be able to set the ByteOrder to LITTLE_ENDIAN and so write bytes without the extra inverting.
Modification:
- Introduce a new special SwappedByteBuf (called UnsafeDirectSwappedByteBuf) that is used by all the Unsafe*ByteBuf implementation and allows to write without inverting the bytes.
- Add benchmark
- Upgrade jmh to 0.8
Result:
The user is be able to get the max performance even on servers that have ByteOrder.LITTLE_ENDIAN as their native ByteOrder.
Motivation:
PooledByteBufAllocator's thread local cache and
ReferenceCountUtil.releaseLater() are in need of a way to run an
arbitrary logic when a certain thread is terminated.
Modifications:
- Add ThreadDeathWatcher, which spawns a low-priority daemon thread
that watches a list of threads periodically (every second) and
invokes the specified tasks when the associated threads are not alive
anymore
- Start-stop logic based on CAS operation proposed by @tea-dragon
- Add debug-level log messages to see if ThreadDeathWatcher works
Result:
- Fixes#2519 because we don't use GlobalEventExecutor anymore
- Cleaner code
Motivation:
If we make allocateRun/SubpageSimple() always try the left node first and make allocateRun/Subpage() always tries the right node first, it is more likely that allocateRun/Subpage() will find a node with ST_UNUSED sooner.
Modifications:
- Make allocateRunSimple() and allocateSubpageSimple() always try the left node first.
- Make allocateRun() and allocateSubpage() always try the right node first.
- Remove randome
Result:
We get the same performance without using random numbers.
Motivation:
We still have a room for improvement in PoolChunk.allocateRun() and
Subpage.allocate().
Modifications:
- Unroll the recursion in PoolChunk.allocateRun()
- Subpage.allocate() makes use of the 'nextAvail' value set by previous
free().
Result:
- PoolChunk.allocateRun() optimization yields 10%+ improvements in
allocation throughput for non-subpage allocations.
- Subpage.allocate() optimization makes the subpage allocations for
tiny buffers as fast as non-tiny buffers even when the pageSize is
huge (e.g. 1048576) because it doesn't need to perform a linear search
in most cases.
Motivation:
4 and 5 were diverged long time ago and we recently reverted some of the
early commits in master. We must make sure 4.1 and master are not very
different now.
Modification:
Fix found differences
Result:
4.1 and master got closer.
Motivation:
PoolArena's 'normalizeCapacity' function was micro-optimized some
time ago to remove a while loop. However, there was a change of
behavior in the function as a result. Capacities passed into it
that are already powers of 2 (and >= 512) are doubled in size. So
if I ask for a buffer with a capacity of 1024, I will get back one
that actually uses 2048 bytes (stored in maxLength).
Aligning to powers of two for book keeping ease is reasonable,
and if someone tries to expand a buffer, you might as well use some
of the previously wasted space. However, since this distinction
between 'easily expanded' and 'costly to expand' space is not
supported at all by the APIs, I cannot imagine this change to
doubling is desirable or intentional.
This is especially costly when using composite buffers. They
frequently allocate components with a capacity that is a power of
2, and they never attempt to expand components themselves. The end
result is that heavy use of pool-backed composite buffers wastes
almost half of the memory pool (the smaller / initial components are
<512 and so are not affected by the off-by-one bug).
Modifications:
Although I find it difficult to believe that such an optimization
is really helpful, I left it in and fixed the off-by-one issue by
decrementing the value at the start.
I also added a simple test to both attempt to verify that the
decrement fixes the issue without introducing any other change, and
to make it easy for a reviewer to test the existing behavior. PoolArena
does not seem to have much testing or testability support though so
the test is kind of a hack and will break for unrelated changes. I
suggest either removing it or factoring out the single non-static
portion of normalizeCapacity so that the fragile dummy PoolArena is
not required.
Result:
Pooled allocators will allocate less resources to the highly
inefficient and undocumented buffer section between length and
maxLength.
Composite buffers of non-trivial size that are backed by pooled
allocators will use about half as much memory.
Motivation:
At the moment we create new ThreadPoolCache whenever a Thread tries either allocate or release something on the PooledByteBufAllocator. When something is released we put it then in its ThreadPoolCache. The problem is we never check if a Thread is not alive anymore and so we may end up with memory that is never freed again if a user create many short living Threads that use the PooledByteBufAllocator.
Modifications:
Periodically check if the Thread is still alive that has a ThreadPoolCache assinged and if not free it.
Result:
Memory is freed up correctly even for short living Threads.
Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things
Modifications:
This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.
At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.
Result:
Less conditions as most allocations can be served by the cache itself
Motivation:
6e8ba291cf introduced a regression in Android because Android does not have sun.nio.ch.DirectBuffer (see #2330.) I also found PlatformDependent0.freeDirectBuffer() and freeDirectBufferUnsafe() are pretty much same after the commit and the unsafe version should be removed.
Modifications:
- Do not use the pooled allocator in Android because it's too resource hungry for Androids.
- Merge PlatformDependent0.freeDirectBuffer() and freeDirectBufferUnsafe() into one method.
- Make the Unsafe unavailable when sun.nio.ch.DirectBuffer is unavailable. We could keep the Unsafe available and handle the sun.nio.ch.DirectBuffer case separately, but I don't want to complicate our code just because of that. All supported JDK versions have sun.nio.ch.DirectBuffer if the Unsafe is available.
Result:
Simpler code. Fixes Android support (#2330)
Motivation:
I was studying the code and thought this was simpler and easier to
understand.
Modifications:
Replaced the for loop and if conditions, with a simple implementation.
Result:
Code is easier to understand.
Motivation:
When starting with a read-only NIO buffer, wrapping it in a ByteBuf,
and then later retrieving a re-wrapped NIO buffer the limit was getting
too short.
Modifications:
Changed ReadOnlyByteBufferBuf.nioBuffer(int,int) to compute the
limit in the same manner as the internalNioBuffer method.
Result:
Round-trip conversion from NIO to ByteBuf to NIO will work reliably.
- Related: #2163
- Add ResourceLeakHint to allow a user to provide a meaningful information about the leak when touching it
- DefaultChannelHandlerContext now implements ResourceLeakHint to tell where the message is going.
- Cleaner resource leak report by excluding noisy stack trace elements
This implementation does not produce as much GC pressure as CompositeByteBuf and so is prefered,
for writing an array of ByteBufs. Be aware that FixedCompositeByteBuf is readonly.
When using this in a project that make heavy use of CompositeByteBuf for writes we was able to cut
down allocation to a half.
- Fixes#1810
- Add a new interface ChannelId and its default implementation which generates globally unique channel ID.
- Replace AbstractChannel.hashCode with ChannelId.hashCode() and ChannelId.shortValue()
- Add variants of ByteBuf.hexDump() which accept byte[] instead of ByteBuf.
- Remove the reference to ResourceLeak from the buffer implementations
and use wrappers instead:
- SimpleLeakAwareByteBuf and AdvancedLeakAwareByteBuf
- It is now allocator's responsibility to create a leak-aware buffer.
- Added AbstractByteBufAllocator.toLeakAwareBuffer() for easier
implementation
- Add WrappedByteBuf to reduce duplication between *LeakAwareByteBuf and
UnreleasableByteBuf
- Raise the level of leak reports to ERROR - because it will break the
app eventually
- Replace enabled/disabled property with the leak detection level
- Only print stack trace when level is ADVANCED or above to avoid user
confusion
- Add the 'leak' build profile, which enables highly detailed leak
reporting during the build
- Remove ResourceLeakException which is unsed anymore
- Fixes#2003 properly
- Instead of using 'bundle' packaging, use 'jar' packaging. This is
more robust because some strict build tools fail to retrieve the
artifacts from a Maven repository unless their packaging is not 'jar'.
- All artifacts now contain META-INF/io.netty.version.properties, which
provides the detailed information about the build and repository.
- Removed OSGi testsuite temporarily because it gives false errors
during split package test and examination.
- Add io.netty.util.Version for easy retrieval of version information
Beside this it also helps to reduce CPU usage as nioBufferCount() is quite expensive when used on CompositeByteBuf which are
nested and contains a lot of components
that are not assigned to the same EventLoop. In general get* operations should always be safe to be used from different Threads.
This aslo include unit tests that show the issue
This is needed because of otherwise the JDK itself will do an extra ByteBuffer copy with it's own pool implementation. Even worth it will be done
multiple times if the ByteBuffer is always only partial written. With this change the copy is done inside of netty using it's own allocator and
only be done one time in all cases.
- A user can create multiple duplicates of a buffer and access their internal NIO buffers. (e.g. write multiple duplicates to multiple channels assigned to different event loop.) Because the derived buffers' internalNioBuffer() simply delegates the call to the original buffer, all derived buffers and the original buffer's internalNioBuffer() will return the same buffer, which will lead to a race condition.
- Fixes#1739
- 5% improvement in throughput (HelloWorldServer example)
- Made CompositeByteBuf a concrete class (renamed from DefaultCompositeByteBuf) because there's no multiple inheritance in Java
Fixes#1536
- Fixes#1528
It's not really easy to provide a general-purpose abstraction for fast-yet-safe iteration. Instead of making forEachByte() less optimal, let's make it do what it does really well, and allow a user to implement potentially unsafe-yet-fast loop using unsafe operations.
* The problem with the release(..) calls here was that it would have called release on an unsupported message and then throw an exception. This exception will trigger ChannelOutboundBuffer.fail(..), which will also try to release the message again.
* Also use the same exception type for unsupported messages as in other channel impls.
- Related: #1378
- They now accept only one argument.
- A user who wants to use a buffer for more complex use cases, he or she can always access the buffer directly via memoryAddress() and array()
.. by avoiding the overly frequent removal of a subpage from a pool
This change makes sure that the unused subpage is not removed when there's no subpage left in the pool. If the last subpage is removed from the pool, it is very likely that the allocator will create a new subpage very soon again, so it's better not remove it.
- No need to have fine-grained lookup table because the buffer pool has
much more coarse capacities available
- No need to use a loop to normalize a buffer capacity
- Fixes#1445
- Add PlatformDependent.maxDirectMemory()
- Ensure the default number or arenas is decreased if the max memory of the VM is not large enough.
- Related issue: #1397
- Resource leak detection should be turned off and the maxCapacity has to be Integer.MAX_VALUE
- It's technically possible to pool PooledByteBufs with different maxCapacity, which will be addressed in another commit.
The API changes made so far turned out to increase the memory footprint
and consumption while our intention was actually decreasing them.
Memory consumption issue:
When there are many connections which does not exchange data frequently,
the old Netty 4 API spent a lot more memory than 3 because it always
allocates per-handler buffer for each connection unless otherwise
explicitly stated by a user. In a usual real world load, a client
doesn't always send requests without pausing, so the idea of having a
buffer whose life cycle if bound to the life cycle of a connection
didn't work as expected.
Memory footprint issue:
The old Netty 4 API decreased overall memory footprint by a great deal
in many cases. It was mainly because the old Netty 4 API did not
allocate a new buffer and event object for each read. Instead, it
created a new buffer for each handler in a pipeline. This works pretty
well as long as the number of handlers in a pipeline is only a few.
However, for a highly modular application with many handlers which
handles connections which lasts for relatively short period, it actually
makes the memory footprint issue much worse.
Changes:
All in all, this is about retaining all the good changes we made in 4 so
far such as better thread model and going back to the way how we dealt
with message events in 3.
To fix the memory consumption/footprint issue mentioned above, we made a
hard decision to break the backward compatibility again with the
following changes:
- Remove MessageBuf
- Merge Buf into ByteBuf
- Merge ChannelInboundByte/MessageHandler and ChannelStateHandler into ChannelInboundHandler
- Similar changes were made to the adapter classes
- Merge ChannelOutboundByte/MessageHandler and ChannelOperationHandler into ChannelOutboundHandler
- Similar changes were made to the adapter classes
- Introduce MessageList which is similar to `MessageEvent` in Netty 3
- Replace inboundBufferUpdated(ctx) with messageReceived(ctx, MessageList)
- Replace flush(ctx, promise) with write(ctx, MessageList, promise)
- Remove ByteToByteEncoder/Decoder/Codec
- Replaced by MessageToByteEncoder<ByteBuf>, ByteToMessageDecoder<ByteBuf>, and ByteMessageCodec<ByteBuf>
- Merge EmbeddedByteChannel and EmbeddedMessageChannel into EmbeddedChannel
- Add SimpleChannelInboundHandler which is sometimes more useful than
ChannelInboundHandlerAdapter
- Bring back Channel.isWritable() from Netty 3
- Add ChannelInboundHandler.channelWritabilityChanges() event
- Add RecvByteBufAllocator configuration property
- Similar to ReceiveBufferSizePredictor in Netty 3
- Some existing configuration properties such as
DatagramChannelConfig.receivePacketSize is gone now.
- Remove suspend/resumeIntermediaryDeallocation() in ByteBuf
This change would have been impossible without @normanmaurer's help. He
fixed, ported, and improved many parts of the changes.
- Fixes#1364
- Even if a user creates a duplicate/slice, lastAccessed was shared between the derived buffers and it's updated even by a read operation, which made multithread access impossible
- Fixes#1282 (not perfectly, but to the extent it's possible with the current API)
- Add AddressedEnvelope and DefaultAddressedEnvelope
- Make DatagramPacket extend DefaultAddressedEnvelope<ByteBuf, InetSocketAddress>
- Rename ByteBufHolder.data() to content() so that a message can implement both AddressedEnvelope and ByteBufHolder (DatagramPacket does) without introducing two getter methods for the content
- Datagram channel implementations now understand ByteBuf and ByteBufHolder as a message with unspecified remote address.
- Fixes#1315
If a user specifies the arena size of 0, the pool is now disabled
instead of raising an IllegalArgumentException. Using this, you can
disable only heap or direct buffer pool easily. Once disabled,
PooledByteBufAllocator will delegate the allocation request to
UnpooledByteBufAllocator.
- Fixes#1229
- Primarily written by @normanmaurer and revised by @trustin
This commit removes the notion of unfolding from the codec framework
completely. Unfolding was introduced in Netty 3.x to work around the
shortcoming of the codec framework where encode() and decode() did not
allow generating multiple messages.
Such a shortcoming can be fixed by changing the signature of encode()
and decode() instead of introducing an obscure workaround like
unfolding. Therefore, we changed the signature of them in 4.0.
The change is simple, but backward-incompatible. encode() and decode()
do not return anything. Instead, the codec framework will pass a
MessageBuf<Object> so encode() and decode() can add the generated
messages into the MessageBuf.
This pull request provides a framework for exchanging a very large
stream between handlers, typically between a decoder and an inbound
handler (or between a handler that writes a message and an encoder that
encodes that message).
For example, an HTTP decoder, previously, generates multiple
micro-messages to decode an HTTP message (i.e. HttpRequest +
HttpChunks). With the streaming API, The HTTP decoder can simply
generate a single HTTP message whose content is a Stream. And then the
inbound handler can consume the Stream via the buffer you created when
you begin to read the stream. If you create a buffer whose capacity is
bounded, you can handle a very large stream without allocating a lot of
memory. If you just want to wait until the whole content is ready, you
can also do that with an unbounded buffer.
The streaming API also supports a limited form of communication between
a producer (i.e. decoder) and a consumer. A producer can abort the
stream if the stream is not valid anymore. A consumer can choose to
reject or discard the stream, where rejection is for unrecoverable
failure and discard is for recoverable failure.
P.S. Special thanks to @jpinner for the initial input.
- Make PoolSubpage a linked list node in the pool
- Now that a subpage is added to and removed from the pool correctly, allocating a subpage from the pool became vastly simpler.
- Rename directbyDefault to preferDirect
- Add a system property 'io.netty.prederDirect' to allow a user from changing the preference on launch-time
- Merge UnpooledByteBufAllocator.DEFAULT_BY_* to DEFAULT
- Rename ChannelHandlerAdapter to ChannelDuplexHandler
- Add ChannelHandlerAdapter that implements only ChannelHandler
- Rename CombinedChannelHandler to CombinedChannelDuplexHandler and
improve runtime validation
- Remove ChannelInbound/OutboundHandlerAdapter which are not useful
- Make ChannelOutboundByteHandlerAdapter similar to
ChannelInboundByteHandlerAdapter
- Make the tail and head handler of DefaultChannelPipeline accept both
bytes and messages. ChannelHandlerContext.hasNext*() were removed
because they always return true now.
- Removed various unnecessary null checks.
- Correct method/field names:
inboundBufferSuspended -> channelReadSuspended
- Move common methods from ByteBuf to Buf
- Rename ensureWritableBytes() to ensureWritable()
- Rename readable() to isReadable()
- Rename writable() to isWritable()
- Add isReadable(int) and isWritable(int)
- Add AbstractMessageBuf
- Rewrite DefaultMessageBuf and QueueBackedMessageBuf
- based on Josh Bloch's public domain ArrayDeque impl
- Rename message types for clarity
- HttpMessage -> FullHttpMessage
- HttpHeader -> HttpMessage
- HttpRequest -> FullHttpRequest
- HttpResponse -> FulllHttpResponse
- HttpRequestHeader -> HttpRequest
- HttpResponseHeader -> HttpResponse
- HttpContent now extends ByteBufHolder; no more content() method
- Make HttpHeaders abstract, make its header access methods public, and
add DefaultHttpHeaders
- Header accessor methods in HttpMessage and LastHttpContent are
replaced with HttpMessage.headers() and
LastHttpContent.trailingHeaders(). Both methods return HttpHeaders.
- Remove setters wherever possible and remove 'get' prefix
- Instead of calling setContent(), a user can either specify the content
when constructing a message or write content into the buffer.
(e.g. m.content().writeBytes(...))
- Overall cleanup & fixes
Now that we are going to use buffer pooling by default, it is obvious
that a user will forget to call .free() and report memory leak. In this
case, we should have a tool to determine if it is a bug in our allocator
implementation or in the user's code.
This pull request adds a system property flag called
'io.netty.resourceLeakDetection'. If set, when a user forgets to call
.free(), the ResourceLeakDetector will detect it and log a message with
detailed stack trace to tell where the leaked buffer has been allocated.
Because obtaining stack trace is an expensive operation, I used sampling
technique. Allocation is recorded only for every 113th allocation. I
chose 113 because it's a prime number.
In production, a user might not want to enable this option due to
potential performance impact. If a user does not specify the
'-Dio.netty.resourceLeakDetection' option leak detection is disabled.
Even if the leak detection is enabled, the overhead should be less than
5% because only ~1% of allocations are monitored.
I also replaced SharedResourceMisuseDetector with ResourceLeakDetector.
- Add PooledUnsafeDirectByteBuf, a variant of PooledDirectByteBuf, which
accesses its underlying direct ByteBuffer using sun.misc.Unsafe.
- To decouple Netty from sun.misc.*, sun.misc.Unsafe is accessed via
PlatformDependent.
- This change solely introduces about 8+% improvement in direct memory
access according to the tests conducted as described in #918
- Rename capacity variables to reqCapacity or normCapacity to distinguish if its the request capacity or the normalized capacity
- Do not reallocate on ByteBuf.capacity(int) if reallocation is unnecessary; just update the index range.
- Revert the workaround in DefaultChannelHandlerContext