Motiviation:
Sometimes it is useful to dump the status of the PooledByteBufAllocator and log it. Doing this is currently a bit cumbersome as the user needs to basically iterate through all the metrics and compose the String. we would better provide an easy way to do this.
Modification:
Add dumpStats() method.
Result:
Easier to get a view into the status of the allocator.
Motivation:
PoolChunkList.allocate(...) should return false without the need to walk all the contained PoolChunks when the requested capacity is larger then the capacity that can be allocated out of the PoolChunks in respect to the minUsage() and maxUsage() of the PoolChunkList.
Modifications:
Precompute the maximal capacity that can be allocated out of the PoolChunks that are contained in the PoolChunkList and use this to fast return from the allocate(...) method if an allocation capacity larger then that is requested.
Result:
Faster detection of allocations that can not be handled by the PoolChunkList and so faster allocations in general via the PoolArena.
Motivation:
To better understand how much memory is used by Netty for ByteBufs it is useful to understand how many bytes are currently active (allocated) per PoolArena.
Modifications:
- Add PoolArenaMetric.numActiveBytes()
Result:
The user is able to get better insight into the PooledByteBufAllocator.
Motivation:
To make it easier to understand PoolChunk and PoolArena we should cleanup duplicated code.
Modifications:
- Move reused code into methods
- Use Math.max(...)
Result:
Cleaner code and easier to understand.
Motivation:
When doing a normal allocation in PoolArena we also tried to allocate out of the PoolChunkList that only contains completely full PoolChunks. This makes no sense as these are full anyway so an allocation will never work here and just gives a perf hit as we need to walk the whole list of PoolChunks in the list.
Modifications:
Not try to allocate from PoolChunkList that only contains full PoolChunks
Result:
Faster allocation times when a new PoolChunk must be created.
Motivation:
We should better use Math utilities as these are intrinsics. This is a cleanup for ea3ffb8536.
Modifications:
Use Math utilities.
Result:
Cleaner code and use of intrinsics.
Motivation:
When a PoolChunk needs to get moved to the previous PoolChunkList because of the minUsage / maxUsage constraints we always just moved it one level which is incorrect and so could lead to have PoolChunks in the wrong PoolChunkList (in respect to their minUsage / maxUsage settings). This then could have the effect that PoolChunks are not released / freed in a timely fashion and so.
Modifications:
- Correctly move PoolChunks between PoolChunkLists, which includes moving it multiple "levels".
- Add unit test
Result:
Correctlty move the PoolChunk to PoolChunkList when it is freed, even if its multiple layers.
Motivation:
The PoolChunkList.minUsage() and maxUsage() needs to take special action to translate Integer.MIN_VALUE / MAX_VALUE as these are used internal for tail and head of the linked-list structure.
Modifications:
- Correct the minUsage() and maxUsage() methods.
- Add unit test.
Result:
Correct metrics
Motivation:
Sometimes it is useful to allow to disable the leak detection of buffers if the UnpooledByteBufAllocator is used. This is for example true if the app wants to leak buffers into user code but not want to put the burden on the user to always release the buffer.
Modifications:
Add another constructor to UnpooledByteBufAllocator that allows to completely disable leak-detection for all buffers that are allocator out of the UnpooledByteBufAllocator.
Result:
It's possible to disable leak-detection when the UnpooledByteBufAllocator is used.
Motivation:
We should only increment the metric for the huge / normal allocation after it is done. Also we should only decrement once deallocate.
Modifications:
- Move increment after the allocation.
- Fix deallocation metric and move it after deallocation
Result:
More correct metrics.
Motivation:
PoolThreadCache includes the wrong value when throwing a IllegalArgumentException because of freeSweepAllocationThreshold.
Modifications:
Use the correct value.
Result:
Correct exception message.
Motivation:
The method setBytes creates temporary heap buffer when source buffer is read-only.
But this temporary buffer is not used correctly and may lead to data corruption.
This problem occurs when target buffer is pooled and temporary buffer
arrayOffset() is not zero.
Modifications:
Use correct arrayOffset when calling PlatformDependent.copyMemory.
Unit test was added to test this case.
Result:
Setting buffer content works correctly when target is pooled buffer and source
is read-only ByteBuffer.
Motivation:
We also need to add synchronization when access fields to ensure we see the latest updates.
Modifications:
Add synchronization when read fields that are written concurrently.
Result:
Ensure correct visibility of updated.
Motivation:
See #1811
Modifications:
Add LineEncoder and LineSeparator
Result:
The user can use LineEncoder to write a String with a line separator automatically
Motivation:
We had some double spacing in the methods which should be removed to keep things consistent.
Modifications:
Remove redundant spaces.
Result:
Cleaner / consistent coding style.
Motivation:
My previous commit b88a980482 introduced a flawed unit test,
that executes an assertion in a different thread than the test thread.
If this assertion fails, the test doesn't fail.
Modifications:
Replace the assertion by a proper workaround.
Result:
More correct unit test
Motivation:
Circular assignment of arenas to thread caches can lead to less than optimal
mappings in cases where threads are (frequently) shutdown and started.
Example Scenario:
There are a total of 2 arenas. The first two threads performing an allocation
would lead to the following mapping:
Thread 0 -> Arena 0
Thread 1 -> Arena 1
Now, assume Thread 1 is shut down and another Thread 2 is started. The current
circular assignment algorithm would lead to the following mapping:
Thread 0 -> Arena 0
Thread 2 -> Arena 0
Ideally, we want Thread 2 to use Arena 1 though.
Presumably, this is not much of an issue for most Netty applications that do all
the allocations inside the eventloop, as eventloop threads are seldomly shut down
and restarted. However, applications that only use the netty-buffer package
or implement their own threading model outside the eventloop might suffer from
increased contention. For example, gRPC Java when using the blocking stub
performs some allocations outside the eventloop and within its own thread pool
that is dynamically sized depending on system load.
Modifications:
Implement a linear scan algorithm that assigns a new thread cache to the arena
that currently backs the fewest thread caches.
Result:
Closer to ideal mappings between thread caches and arenas. In order to always
get an ideal mapping, we would have to re-balance the mapping whenever a thread
dies. However, that's difficult because of deallocation.
Motivation:
The statistic counters PoolArena.(allocationsTiny|allocationsSmall) are
not protected by a per arena lock, but by a per size class lock. Thus,
two concurrent allocations of different size (class) could lead to a
race and ultimately to wrong statistics.
Modifications:
Use a thread-safe LongCounter instead of a plain long data type.
Result:
Fewer data races.
Motivation:
See #3321
Modifications:
1. Add CharsetUtil.encoder/decoder() methods
2. Deprecate CharsetUtil.getEncoder/getDecoder() methods
Result:
Users can use new CharsetUtil.encoder/decoder() to specify error actions
Motivation:
Utility methods in ByteBufUtil to writeUtf8 and writeAscii expect a buffer to already be allocated. If the user does not have a buffer allocated they have to know details of the encoding in order to know the size of the buffer to allocate.
Modifications:
- Add writeUtf8 and writeAscii which take a ByteBufAllocator and allocate a ByteBuf of the correct size for the user
Result:
ByteBufUtil methods which are easier to use if the user doesn't already have a ByteBuf.
Motivation:
[#4842] introduced 4 new methods but missed to implement advanced leak detection for these.
Modifications:
Correctly implement advanced leak detection for these methods.
Result:
Advanced leak detection works for all methods as expected.
Motivation
See ##3229
Modifications:
Add methods with position independent FileChannel calls to ByteBuf and its subclasses.
Results:
The user can use these new methods to read/write ByteBuff without updating FileChannel's position.
Motivation:
f750d6e36c added support for surrogates in the writeUtf8 conversion. However exceptions are thrown if invalid input is detected, but the JDK (and slow path of writeUtf8) uses a replacement character and does not throw. We should behave the same way.
Modificiations:
- Don't throw in ByteBufUtil.writeUtf8, and instead use a replacement character consistent with the JDK
Result:
ByteBufUtil.writeUtf8 behavior is consistent with the JDK UTF_8 conversion.
Motivation:
We missed to take the byte[] into account when try to access the bytes and so produce a segfault.
Modifications:
Correctly pass the byte[] in.
Result:
No more segfault.
Motivation:
The current interface for CompositeByteBuf.addComponent is not clear under what conditions ownership is transferred when addComponent is called. There should be a well defined behavior so that users can ensure that no leaks occur.
Modifications:
- CompositeByteBuf.addComponent should always assume reference count ownership
Result:
Users that call CompositeByteBuf.addComponent do not have to independently check if the buffer's ownership has been transferred and if not independently release the buffer.
Fixes https://github.com/netty/netty/issues/4760
Motivation:
CompositeByteBuf only implemented simple resource leak detection and how it was implemented was completly different to the way it was for ByteBuf. The other problem was that slice(), duplicate() and others would not return a resource leak enabled buffer.
Modifications:
- Proper implementation for all level of resource leak detection for CompositeByteBuf
Result:
Proper resource leak detection for CompositeByteBuf.
Motivation:
We missed reporting *LE operations when AdvancedLeakAwareByteBuf was used. This could lead to incomplete access reports.
Modifications:
Correctly record access for *LE operations.
Result:
Correct leak reports.
Motivation:
AdvancedLeakAwareByteBuf.forEachByteDesc(...) called recordLeakNonRefCountingOperation() two times which resulted in incorrect leak detection reports.
Modifications:
Remove duplicated call to recordLeakNonRefCountingOperation()
Result:
Correct leak detection results
Motivation:
There are a few buffer leaks related to how Unpooled.wrapped and Base64.encode is used.
Modifications:
- Fix usages of Bas64.encode to correct leaks
- Clarify interface of Unpooled.wrapped* to ensure reference count ownership is clearly defined.
Result:
Reference count code is more clearly defined and less leaks are possible.
Motivation:
Javadoc reports errors about invalid docs.
Modifications:
Fix some errors reported by javadoc.
Result:
A lot of javadoc errors are fixed by this patch.
Motivation:
There are some wrong links and tags in javadoc.
Modifications:
Fix the wrong links and tags in javadoc.
Result:
These links will work correctly in javadoc.
Motivation:
UTF-16 can not represent the full range of Unicode characters, and thus has the concept of Surrogate Pair (http://unicode.org/glossary/#surrogate_pair) where 2 16-bit code units can be used to represent the missing characters. ByteBufUtil.writeUtf8 is currently does not support this and is thus incomplete.
Modifications:
- Add support for surrogate pairs in ByteBufUtil.writeUtf8
Result:
ByteBufUtil.writeUtf8 now supports surrogate pairs and is correctly converting to UTF-8.
Motivation:
We missed to check if the dst is ready only before using unsafe to copy data into it which lead to data-corruption. We need to ensure we respect ready only ByteBuffer.
Modifications:
- Correctly check if the dst is ready only before copy data into it in UnsafeByteBufUtil
- Also make it work for buffers that are not direct and not have an array
Result:
No more data corruption possible if the dst buffer is readonly and unsafe buffer implementation is used.
Motivation:
Initialisation of the ByteBufUtil class, a class frequently used is
delayed because a significant number of String operations is performed to
fill a HEXDUMP_ROWPREFIXES array. This array also sticks to the Strings
forever.
It is quite likely that applications never use the hexdump facility.
Modification:
Moved the static initialisation and references to a static inner class.
This delays initialisation (and memory usage) until actually needed.
The API is kept as is.
Result:
Faster startup time, less memory usage for most netty using applications.
As discussed in #3209, this PR adds Little Endian accessors
to ByteBuf and descendants.
Corresponding accessors were added to UnsafeByteBufUtil,
HeapByteBufferUtil to avoid calling `reverseBytes`.
Deprecate `order()`, `order(buf)` and `SwappedByteBuf`.
Motivation:
The method setBytes did not work correctly because read-only ByteBuffer
does not allow access to its underlying array.
Modifications:
New case was added for ByteBuffer's that are not direct and do not have an array.
These must be handled by copying the data into a temporary array. Unit test was
added to test this case.
Result:
It is now possible to use read-only ByteBuffer as the source
for the setBytes method.
Motivation:
The HTTP/2 RFC (https://tools.ietf.org/html/rfc7540#section-8.1.2) indicates that header names consist of ASCII characters. We currently use ByteString to represent HTTP/2 header names. The HTTP/2 RFC (https://tools.ietf.org/html/rfc7540#section-10.3) also eludes to header values inheriting the same validity characteristics as HTTP/1.x. Using AsciiString for the value type of HTTP/2 headers would allow for re-use of predefined HTTP/1.x values, and make comparisons more intuitive. The Headers<T> interface could also be expanded to allow for easier use of header types which do not have the same Key and Value type.
Motivation:
- Change Headers<T> to Headers<K, V>
- Change Http2Headers<ByteString> to Http2Headers<CharSequence, CharSequence>
- Remove ByteString. Having AsciiString extend ByteString complicates equality comparisons when the hash code algorithm is no longer shared.
Result:
Http2Header types are more representative of the HTTP/2 RFC, and relationship between HTTP/2 header name/values more directly relates to HTTP/1.x header names/values.
Motivation:
Modulo operations are slow, we can use bitwise operation to detect if resource leak detection must be done while sampling.
Modifications:
- Ensure the interval is a power of two
- Use bitwise operation for sampling
- Add benchmark.
Result:
Faster sampling.
Motivation:
Fix a race condition that was introduced by f18990a8a5 that could lead to a NPE when allocate from the PooledByteBufAllocator concurrently by many threads.
Modifications:
Correctly synchronize on the PoolSubPage head.
Result:
No more race.
Motiviation:
We have a lot of duplicated code which makes it hard to maintain.
Modification:
Move shared code to UnsafeByteBufUtil and use it in the implementations.
Result:
Less duplicated code and so easier to maintain.
Motiviation:
We have a lot of duplicated code which makes it hard to maintain.
Modification:
Move shared code to HeapByteBufUtil and use it in the implementations.
Result:
Less duplicated code and so easier to maintain.
Motivation:
sun.misc.Unsafe allows us to handle heap ByteBuf in a more efficient matter. We should use special ByteBuf implementation when sun.misc.Unsafe can be used to increase performance.
Modifications:
- Add PooledUnsafeHeapByteBuf and UnpooledUnsafeHeapByteBuf that are used when sun.misc.Unsafe is ready to use.
- Add UnsafeHeapSwappedByteBuf
Result:
Better performance when using heap buffers and sun.misc.Unsafe is ready to use.
Motivation:
We had a bug in our implemention which double "reversed" bytes on systems which not support unaligned access.
Modifications:
- Correctly only reverse bytes if needed.
- Share code between unsafe implementations.
Result:
No more data-corruption on sytems without unaligned access.
Motivation:
When moving bytes between a PooledUnsafeDirectByteBuf or an UnpooledUnsafeDirectByteBuf
and a ByteBuffer, a temp ByteBuffer is allocated and will need to be GCed. This is a
common case since a ByteBuffer is always needed when reading/writing on a file,
for example.
Modifications:
Use PlatformDependent.copyMemory() to avoid the need for the temp ByteBuffer
Result:
No temp ByteBuffer allocated and GCed.
Motivation:
SlicedByteBuf did double reference count checking for various bulk operations, which affects performance.
Modifications:
- Add package private method to AbstractByteBuf that can be used to check indexes without check the reference count
- Use this new method in the bulk operation os SlicedByteBuf as the reference count checks take place on the wrapped buffer anyway
- Fix test-case to not try to read data that is out of the bounds of the buffer.
Result:
Better performance on bulk operations when using SlicedByteBuf (and sub-classes)
Motivation:
Some of the tests in the buffer module contained unused code. Some of the tests also used unnecessary inheritance which could be avoided to simplify code.
Modifications:
Cleanup the test cases.
Result:
Cleaner code, less cruft.
Motivation:
We need to always return a real slice even when the requested length is 0. This is needed as otherwise we not correctly share the reference count and so may leak a buffer if the user call release() on the returned slice and expect it to decrement the reference count of the "parent" buffer.
Modifications:
- Always return a real slice
- Add unit test for the bug.
Result:
No more leak possible when a user requests a slice of length 0 of a SlicedByteBuf.
Motivation:
SlicedByteBuf can be used for any ByteBuf implementations and so can not do any optimizations that could be done
when AbstractByteBuf is sliced.
Modifications:
- Add SlicedAbstractByteBuf that can eliminate range and reference count checks for _get* and _set* methods.
Result:
Faster SlicedByteBuf implementations for AbstractByteBuf sub-classes.
Motivation:
DuplicatedByteBuf can be used for any ByteBuf implementations and so can not do any optimizations that could be done
when AbstractByteBuf is duplicated.
Modifications:
- Add DuplicatedAbstractByteBuf that can eliminate range and reference count checks for _get* and _set* methods.
Result:
Faster DuplicatedByteBuf implementations for AbstractByteBuf sub-classes.
Motivation:
Calling AbstractByteBuf.toString(..., Charset) is used quite frequently by users but produce a lot of GC.
Modification:
- Use a FastThreadLocal to store the CharBuffer that are needed for decoding.
- Use internalNioBuffer(...) when possible
Result:
Less object creation / Less GC
Motiviation:
Checking reference count on every access on a ByteBuf can have some big performance overhead depending on how the access pattern is. If the user is sure that there are no reference count errors on his side it should be possible to disable the check and so gain the max performance.
Modification:
- Add io.netty.buffer.bytebuf.checkAccessible system property which allows to disable the checks. Enabled by default.
- Add microbenchmark
Result:
Increased performance for operations on the ByteBuf.
Motivation:
We should minimize and optimize bound checks as much as possible to get the most out of performance.
Modifications:
- Use bitwise operations to remove branching
- Remove branches when possible
Result:
Better performance for various operations.
Motivation:
ByteBufUtil.writeUtf8(...) / writeUsAscii(...) can use a fast-path when writing into AbstractByteBuf. We should try to unwrap WrappedByteBuf implementations so
we are able to do the same on wrapped AbstractByteBuf instances.
Modifications:
- Try to unwrap WrappedByteBuf to use the fast-path
Result:
Faster writing of utf8 and usascii for WrappedByteBuf instances.
Motivation:
As toString() is often used while logging we need to ensure this produces no exception.
Modifications:
Ensure we never throw an IllegalReferenceCountException.
Result:
Be able to log without produce exceptions.
Motivation:
The logic in ByteBufUtilTest.ByteBufUtilTest is wrong. It is attempting to ensure at least 1 byte is different in the ranges that will be subsequently compared, but does so before the copy operation.
Modifications:
- Move the code which ensures there is a difference to after the copy
- Simplify the logic which ensures there is a difference
Result:
Unit test now operates as designed.
Motivation:
ByteBufUtilTest.notEqualsBufferSubsections is testing non-equality but just uses random numbers to assume they will not be equal. Even after the random bytes are generated we should check they are infact not equal so the test has no chance of failing when it should not.
Modifications:
- Loop through bytes in notEqualsBufferSubsections after they are randomly generated to ensure there is atleast 1 difference.
Result:
More reliable unit tests.
Motivation:
We need to ensure all markers are reset when doing an allocation via the PooledByteBufAllocator. This was not the always the case.
Modifications:
Move all logic that needs to get executed when reuse a PooledByteBuf into one place and call it.
Result:
Correct behavior
Motivation:
When AsciiString is used we can optimize the write operation done by ByteBufUtil.writeUsAscii(...)
Modifications:
Sepcial handle AsciiString.
Result:
Faster writing of AsciiString.
Motivation:
The configurable property value recently added was not logged like others properties.
Modifications:
Added debug log with effective value applied.
Result:
Consistent with other properties
Motivation:
Leak detector, when it detects a leak, will print the last 5 stack
traces that touched the ByteBuf. In some cases that might not be enough
to identify the root cause of the leak.
Also, sometimes users might not be interested in tracing all the
operations on the buffer, but just the ones that are affecting the
reference count.
Modifications:
Added command line properties to override default values:
* Allow to configure max number of stack traces to collect
* Allow to only record retain/release operation on buffers
Result:
Users can increase the number of stack traces to debug buffer leaks
with lot of retain/release operations.
Motivation:
Currently the "derived" buffer will only ever be recycled if the release call is made on the "derived" object, and the "wrapped" buffer ends up being "fully released" (aka refcount goes to 0). From my experience this is not the common use case and thus the "derived" buffers will not be recycled.
Modifications:
- revert https://github.com/netty/netty/pull/3788
Result:
Less complexity, and less code to create new objects in majority of cases.
Motivation:
Even though MemoryRegionCache$Entry instances are allocated through a recycler they are not properly recycled,
leaving a lot of instances to be GCed along with Recycler$DefaultHandle objects.
Fixes#4071
Modification:
Recycle Entry when done using it.
Result:
Less GCed objects.
Motiviation:
The current read loops don't fascilitate reading a maximum amount of bytes. This capability is useful to have more fine grain control over how much data is injested.
Modifications:
- Add a setMaxBytesPerRead(int) and getMaxBytesPerRead() to ChannelConfig
- Add a setMaxBytesPerIndividualRead(int) and getMaxBytesPerIndividualRead to ChannelConfig
- Add methods to RecvByteBufAllocator so that a pluggable scheme can be used to control the behavior of the read loop.
- Modify read loop for all transport types to respect the new RecvByteBufAllocator API
Result:
The ability to control how many bytes are read for each read operation/loop, and a more extensible read loop.
Motivation:
As we modify the position of the passed in ByteBuffer's this methods are not thread-safe.
Modifications:
Duplicate the input ByteBuffers before copy the content to byte[].
Result:
Unpooled.copiedBuffer(ByteBuffer) and copiedBuffer(ByteBuffer...) is now thread-safe.
Motivation:
The javadoc of ByteBuf contained some out-dated code.
Modifications:
Update code example in javadoc to use netty 4+ API
Result:
Correct javadocs
Motivation:
FixedCompositeByteBuf does not properly implement a number of methods for
copying its content to direct buffers and output streams
Modifications:
Replace improper use of capacity() with readableBytes() when computing offesets during writes
Result:
Copying works correctly
Motivation:
At the moment we use 1 * cores as default mimimum for pool arenas. This can easily lead to conditions as we use 2 * cores as default for EventLoop's when using NIO or EPOLL. If we choose a smaller number we will run into hotspots as allocation and deallocation needs to be synchronized on the PoolArena.
Modifications:
Change the default number of arenas to 2 * cores.
Result:
Less conditions when using the default settings.
Motivation:
Dumping the content of a ByteBuf in a hex format is very useful.
Modifications:
Move code into ByteBufUtil so its easy to reuse.
Result:
Easy to reuse dumping code.
Motivation:
PoolThreadCache did only cache allocations if the allocation and deallocation Thread were the same. This is not optimal as often people write from differen thread then the actual EventLoop thread.
Modification:
- Add MpscArrayQueue which was forked from jctools and lightly modified.
- Use MpscArrayQueue for caches and always add buffer back to the cache that belongs to the allocation thread.
Result:
ThreadPoolCache is now also usable and so gives performance improvements when allocation and deallocation thread are different.
Performance when using same thread for allocation and deallocation is noticable worse then before.
Motivation:
Currently we hold a lock on the PoolArena when we allocate / free PoolSubpages, which is wasteful as this also affects "normal" allocations. The same is true vice-verse.
Modifications:
Ensure we synchronize on the head of the PoolSubPages pool. This is done per size and so it is possible to concurrently allocate / deallocate PoolSubPages with different sizes, and also normal allocations.
Result:
Less condition and so faster allocation/deallocation.
Before this commit:
xxx:~/wrk $ ./wrk -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/pipeline-many.lua http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 17.61ms 29.52ms 689.73ms 97.27%
Req/Sec 278.93k 41.97k 351.04k 84.83%
530527460 requests in 2.00m, 71.64GB read
Requests/sec: 4422226.13
Transfer/sec: 611.52MB
After this commit:
xxx:~/wrk $ ./wrk -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/pipeline-many.lua http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.85ms 24.50ms 681.61ms 97.42%
Req/Sec 287.14k 38.39k 360.33k 85.88%
547902773 requests in 2.00m, 73.99GB read
Requests/sec: 4567066.11
Transfer/sec: 631.55MB
This is reproducable every time.
Motiviation:
At the moment we sometimes hold the lock on the PoolArena during destroy a PoolChunk. This is not needed.
Modification:
- Ensure we not hold the lock during destroy a PoolChunk
- Move all synchronized usage in PoolArena
- Cleanup
Result:
Less condition.
Motivation:
The PooledByteBufAllocator is more or less a black-box atm. We need to expose some metrics to allow the user to get a better idea how to tune it.
Modifications:
- Expose different metrics via PooledByteBufAllocator
- Add *Metrics interfaces
Result:
It is now easy to gather metrics and detail about the PooledByteBufAllocator and so get a better understanding about resource-usage etc.
Motivation:
At the moment when calling slice(...) or duplicate(...) on a Pooled*ByteBuf a new SlicedByteBuf or DuplicatedByteBuf. This can create a lot of GC.
Modifications:
Add PooledSlicedByteBuf and PooledDuplicatedByteBuf which will be used when a PooledByteBuf is used.
Result:
Less GC.
Motivation:
From the javadocs of ByteBuf.duplicate() it is not clear if the reader and writer marks will be duplicated.
Modifications:
Add sentence to clarify that marks will not be duplicated.
Result:
Clear semantics.
Motivation:
When allocate a PooledByteBuf we need to ensure to also reset the markers for the readerIndex and writerIndex.
Modifications:
- Correct reset the markers
- Add test-case for it
Result:
Correctly reset markers.
Motiviation:
When tried to allocate tiny and small sized and failed to serve these out of the PoolSubPage we exit the synchronization
block just to enter it again when call allocateNormal(...).
Modification:
Not exit the synchronized block until allocateNormal(...) is done.
Result:
Better performance.
Motivation:
The way of firstIndexOf and lastIndexOf iterating the ByteBuf is similar to forEachByte and forEachByteDesc, but have many range checks.
Modifications:
Use forEachByte and a IndexOfProcessor to find occurrence.
Result:
eliminate range checks
Motivation:
The ByteString class currently assumes the underlying array will be a complete representation of data. This is limiting as it does not allow a subsection of another array to be used. The forces copy operations to take place to compensate for the lack of API support.
Modifications:
- add arrayOffset method to ByteString
- modify all ByteString and AsciiString methods that loop over or index into the underlying array to use this offset
- update all code that uses ByteString.array to ensure it accounts for the offset
- add unit tests to test the implementation respects the offset
Result:
ByteString and AsciiString can represent a sub region of a byte[].
Motivation:
CompositeByteBuf.iterator() currently creates a new ArrayList and fill it with the ByteBufs, which is more expensive then it needs to be.
Modifications:
- Use special Iterator implementation
Result:
Less overhead when calling iterator()
Motivation:
The usage and code within AsciiString has exceeded the original design scope for this class. Its usage as a binary string is confusing and on the verge of violating interface assumptions in some spots.
Modifications:
- ByteString will be created as a base class to AsciiString. All of the generic byte handling processing will live in ByteString and all the special character encoding will live in AsciiString.
Results:
The AsciiString interface will be clarified. Users of AsciiString can now be clear of the limitations the class imposes while users of the ByteString class don't have to live with those limitations.
Motivation:
When create NormalMemoryRegionCache for PoolThreadCache, we overbooked
cache array size. This means unnecessary overhead for thread local cache
as we will create multi cache enties for each element in cache array.
Modifications:
change:
int arraySize = Math.max(1, max / area.pageSize);
to:
int arraySize = Math.max(1, log2(max / area.pageSize) + 1);
Result:
Now arraySize won't introduce unnecessary overhead.
Changes to be committed:
modified: buffer/src/main/java/io/netty/buffer/PoolThreadCache.java
Motivation:
CompositeByteBuf has an iterator() method but fails to implement Iterable
Modifications:
Let CompositeByteBuf implement Iterable<ByteBuf>
Result:
Easier usage
Motivation:
We missed to dereference the chunk and tmpNioBuf when calling deallocate(). This means the GC can not collect these as we still hold a reference while have the PooledByteBuf in the recycler stack.
Modifications:
Dereference chunk and tmpNioBuf.
Result:
GC can collect things.
Motiviation:
At the moment we use FIFO for the PoolThreadCache which is sub-optimal as this may reduce the changes to have the cached memory actual still in the cpu-cache.
Modification:
- Change to use LIFO as this increase the chance to be able to serve buffers from the cpu-cache
Results:
Faster allocation out of the ThreadLocal cache.
Before the commit:
[xxx wrk]$ ./wrk -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/pipeline-many.lua http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 14.69ms 10.06ms 131.43ms 80.10%
Req/Sec 283.89k 40.37k 433.69k 66.81%
533859742 requests in 2.00m, 72.09GB read
Requests/sec: 4449510.51
Transfer/sec: 615.29MB
After the commit:
[xxx wrk]$ ./wrk -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/pipeline-many.lua http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 16.38ms 26.32ms 734.06ms 97.38%
Req/Sec 283.86k 39.31k 361.69k 83.38%
540836511 requests in 2.00m, 73.04GB read
Requests/sec: 4508150.18
Transfer/sec: 623.40MB
Motivation:
The DefaultHttp2ConnectionDecoder class is calling verifyPrefaceReceived() for almost every frame event at all times.
The Http2ConnectionHandler class is calling readClientPrefaceString() on every decode event.
Modifications:
- DefaultHttp2ConnectionDecoder should not have to continuously call verifyPrefaceReceived() because it transitions boolean state 1 time for each connection.
- Http2ConnectionHandler should not have to continuously call readClientPrefaceString() because it transitions boolean state 1 time for each connection.
Result:
- Less conditional checks for the mainstream usage of the connection.
Motivation:
`Unpooled` javadoc's mentioned the generation of hex dump and swapping an integer's byte order,
which are actually provided by `ByteBufUtil`.
Modifications:
Sentence moved to `ByteBufUtil` javadoc.
Result:
`Unpooled` javadoc is correct.
Motivation:
At the moment we have two problems:
- CompositeByteBuf.addComponent(...) will not add the supplied buffer to the CompositeByteBuf if its empty, which means it will not be released on CompositeByteBuf.release() call. This is a problem as a user will expect everything added will be released (the user not know we not added it).
- CompositeByteBuf.addComponents(...) will either add no buffers if none is readable and so has the same problem as addComponent(...) or directly release the ByteBuf if at least one ByteBuf is readable. Again this gives inconsistent handling and may lead to memory leaks.
Modifications:
- Always add the buffer to the CompositeByteBuf and so release it on release call.
Result:
Consistent handling and no buffer leaks.
Motivation:
When a CompositeByteBuf is empty (i.e. has no component), its internal
memory access operations do not always behave as expected.
Modifications:
Check if the nunmber of components is zero. If so, return an empty
array or an empty NIO buffer, etc.
Result:
More robustness
- Ensure an EmptyByteBuf has an array, an NIO buffer, and a memory
address at the same time
- Add an assertion that checks if EMPTY_BUFFER is an EmptyByteBuf,
just in case we make a mistake in the future
Motivation:
We expose no methods in ByteBuf to directly write a CharSequence into it. This leads to have the user either convert the CharSequence first to a byte array or use CharsetEncoder. Both cases have some overheads and we can do a lot better for well known Charsets like UTF-8 and ASCII.
Modifications:
Add ByteBufUtil.writeAscii(...) and ByteBufUtil.writeUtf8(...) which can do the task in an optimized way. This is especially true if the passed in ByteBuf extends AbstractByteBuf which is true for all of our implementations which not wrap another ByteBuf.
Result:
Writing an ASCII and UTF-8 CharSequence into a AbstractByteBuf is a lot faster then what the user could do by himself as we can make use of some package private methods and so eliminate reference and range checks. When the Charseq is not ASCII or UTF-8 we can still do a very good job and are on par in most of the cases with what the user would do.
The following benchmark shows the improvements:
Result: 2456866.966 ?(99.9%) 59066.370 ops/s [Average]
Statistics: (min, avg, max) = (2297025.189, 2456866.966, 2586003.225), stdev = 78851.914
Confidence interval (99.9%): [2397800.596, 2515933.336]
Benchmark Mode Samples Score Score error Units
i.n.m.b.ByteBufUtilBenchmark.writeAscii thrpt 50 9398165.238 131503.098 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiString thrpt 50 9695177.968 176684.821 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiStringViaArray thrpt 50 4788597.415 83181.549 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiStringViaArrayWrapped thrpt 50 4722297.435 98984.491 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiStringWrapped thrpt 50 4028689.762 66192.505 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiViaArray thrpt 50 3234841.565 91308.009 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiViaArrayWrapped thrpt 50 3311387.474 39018.933 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiWrapped thrpt 50 3379764.250 66735.415 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8 thrpt 50 5671116.821 101760.081 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8String thrpt 50 5682733.440 111874.084 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8StringViaArray thrpt 50 3564548.995 55709.512 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8StringViaArrayWrapped thrpt 50 3621053.671 47632.820 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8StringWrapped thrpt 50 2634029.071 52304.876 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8ViaArray thrpt 50 3397049.332 57784.119 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8ViaArrayWrapped thrpt 50 3318685.262 35869.562 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8Wrapped thrpt 50 2473791.249 46423.114 ops/s
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,387.417 sec - in io.netty.microbench.buffer.ByteBufUtilBenchmark
Results :
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
Results :
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
The *ViaArray* benchmarks are basically doing a toString().getBytes(Charset) which the others are using ByteBufUtil.write*(...).
Motivation:
CompositeByteBuf.nioBuffers(...) returns an empty ByteBuffer array if the specified length is 0. This is not consistent with other ByteBuf implementations which return an ByteBuffer array of size 1 with an empty ByteBuffer included.
Modifications:
Make CompositeByteBuf.nioBuffers(...) consistent with other ByteBuf implementations.
Result:
Consistent and correct behaviour of nioBufffers(...)
Motivation:
When calling slice(...) on a ByteBuf the returned ByteBuf should be the slice of a ByteBuf and shares it's reference count. This is important as it is perfect legal to use buf.slice(...).release() and have both, the slice and the original ByteBuf released. At the moment this is only the case if the requested slice size is > 0. This makes the behavior inconsistent and so may lead to a memory leak.
Modifications:
- Never return Unpooled.EMPTY_BUFFER when calling slice(...).
- Adding test case for buffer.slice(...).release() and buffer.duplicate(...).release()
Result:
Consistent behaviour and so no more leaks possible.
Motivation:
Before we missed to check if a buffer was released before we return the backing byte array or memoryaddress. This could lead to JVM crashes when someone tried various bulk operations on the Unsafe*ByteBuf implementations.
Modifications:
Always check if the buffer is released before all to return the byte array and memoryaddress.
Result:
No more JVM crashes because of released buffers when doing bulk operations on Unsafe*ByteBuf implementations.
Motivation:
Found performance issues via FindBugs and PMD.
Modifications:
- Removed unnecessary boxing/unboxing operations in DefaultTextHeaders.convertToInt(CharSequence) and DefaultTextHeaders.convertToLong(CharSequence). A boxed primitive is created from a string, just to extract the unboxed primitive value.
- Added a static modifier for DefaultHttp2Connection.ParentChangedEvent class. This class is an inner class, but does not use its embedded reference to the object which created it. This reference makes the instances of the class larger, and may keep the reference to the creator object alive longer than necessary.
- Added a static compiled Pattern to avoid compile it each time it is used when we need to replace some part of authority.
- Improved using of StringBuilders.
Result:
Performance improvements.
Motivation:
We introduced a PoolThreadCache which is used in our PooledByteBufAllocator to reduce the synchronization overhead on PoolArenas when allocate / deallocate PooledByteBuf instances. This cache is used for both the allocation path and deallocation path by:
- Look for cached memory in the PoolThreadCache for the Thread that tries to allocate a new PooledByteBuf and if one is found return it.
- Add the memory that is used by a PooledByteBuf to the PoolThreadCache of the Thread that release the PooledByteBuf
This works out very well when all allocation / deallocation is done in the EventLoop as the EventLoop will be used for read and write. On the otherside this can lead to surprising side-effects if the user allocate from outside the EventLoop and and pass the ByteBuf over for writing. The problem here is that the memory will be added to the PoolThreadCache that did the actual write on the underlying transport and not on the Thread that previously allocated the buffer.
Modifications:
Don't cache if different Threads are used for allocating/deallocating
Result:
Less confusing behavior for users that allocate PooledByteBufs from outside the EventLoop.
Motivation:
When MemoryRegionCache.trim() is called, some unused cache entries will be freed (started from head). However, in MeoryRegionCache.trim() the head is not updated, which make entry list's head point to an entry whose chunk is null now and following allocate of MeoryRegionCache will return false immediately.
In other word, cache is no longer usable once trim happen.
Modifications:
Update head to correct idx after free entries in trim().
Result:
MemoryRegionCache behaves correctly even after calling trim().
Motivation:
We received a bug-report that the ByteBuf.refCnt() does sometimes not show the correct value when release() and refCnt() is called from different Threads.
Modifications:
Add test-case which shows that all is working like expected
Result:
Test-case added which shows everything is ok.
Related issue: #2028
Motivation:
Some copiedBuffer() methods in Unpooled allocated a direct buffer. An
allocation of a direct buffer is an expensive operation, and thus should
be avoided for unpooled buffers.
Modifications:
- Use heap buffers in all copiedBuffer() methods
Result:
Unpooled.copiedBuffers() are less expensive now.
Motivation:
While porting some changes from 4.0 to 4.1 and master branch I changed the default allocator from pooled to unpooled by mistake. This should be reverted. The guilty commit is 4a3ef90381.
Thanks to @blucas for spotting this.
Modifications:
Revert changes related to allocator.
Result:
Use the correct default allocator again.
Motivation:
We did various changes related to the ChannelOutboundBuffer in 4.0 branch. This commit port all of them over and so make sure our branches are synced in terms of these changes.
Related to [#2734], [#2709], [#2729], [#2710] and [#2693] .
Modification:
Port all changes that was done on the ChannelOutboundBuffer.
This includes the port of the following commits:
- 73dfd7c01b
- 997d8c32d2
- e282e504f1
- 5e5d1a58fd
- 8ee3575e72
- d6f0d12a86
- 16e50765d1
- 3f3e66c31a
Result:
- Less memory usage by ChannelOutboundBuffer
- Same code as in 4.0 branch
- Make it possible to use ChannelOutboundBuffer with Channel implementation that not extends AbstractChannel
Modifications:
- Added a static modifier for CompositeByteBuf.Component.
This class is an inner class, but does not use its embedded reference to the object which created it. This reference makes the instances of the class larger, and may keep the reference to the creator object alive longer than necessary.
- Removed unnecessary boxing/unboxing operations in HttpResponseDecoder, RtspResponseDecoder, PerMessageDeflateClientExtensionHandshaker and PerMessageDeflateServerExtensionHandshaker
A boxed primitive is created from a String, just to extract the unboxed primitive value.
- Removed unnecessary 3 times calculations in DiskAttribute.addContent(...).
- Removed unnecessary checks if file exists before call mkdirs() in NativeLibraryLoader and PlatformDependent.
Because the method mkdirs() has this check inside.
- Removed unnecessary `instanceof AsciiString` check in StompSubframeAggregator.contentLength(StompHeadersSubframe) and StompSubframeDecoder.getContentLength(StompHeaders, long).
Because StompHeaders.get(CharSequence) always returns java.lang.String.
Motivation:
I introduced ensureAccessible() class as part of 6c47cc9711 in some places. Unfortunally I also added some where these are not needed and so caused a performance regression.
Modification:
Remove calls where not needed.
Result:
Fixed performance regression.
Motivation:
I introduced range checks as part of 6c47cc9711 in some places. Unfortunally I also added some where these are not needed and so caused a performance regression.
Modification:
Remove range checks where not needed
Result:
Fixed performance regression.
Motivation:
CompositeByteBuf.deallocate generates unnecessary GC pressure when using the 'foreach' loop, as a 'foreach' loop creates an iterator when looping.
Modification:
Convert 'foreach' loop into regular 'for' loop.
Result:
Less GC pressure (and possibly more throughput) as the 'for' loop does not create an iterator
Motivation:
AbstractByteBufTest.testInternalBuffer() uses writeByte() operations to
populate the sample data. Usually, this isn't a problem, but it starts
to take a lot of time when the resource leak detection level gets
higher.
In our CI machine, testInternalBuffer() takes more than 30 minutes,
causing the build timeout when the 'leak' profile is active (paranoid
level resource detection.)
Modification:
Populate the sample data using ThreadLocalRandom.nextBytes() instead of
using millions of writeByte() operations.
Result:
Test runs much faster when leak detection level is high.
Motivation:
Because of how we use reference counting we need to check for the reference count before each operation that touches the underlying memory. This is especially true as we use sun.misc.Cleaner.clean() to release the memory ASAP when possible. Because of this the user may cause a SEGFAULT if an operation is called that tries to access the backing memory after it was released.
Modification:
Correctly check the reference count on all methods that access the underlying memory or expose it via a ByteBuffer.
Result:
Safer usage of ByteBuf
- Using short[] for memoryMap did not improve performance. Reverting
back to the original dual-byte[] structure in favor of simplicity.
- Optimize allocateRun() which yields small performence improvement
- Use local variable when member fields are accessed more than once
Motivation:
Depth-first search is not always efficient for buddy allocation.
Modification:
Employ a new faster search algorithm with different memoryMap layout.
Result:
With thread-local cache disabled, we see a lot of performance
improvment, especially when the size of the allocation is as small as
the page size, which had the largest search space previously.