Motivation:
The CompositeByteBuf discardReadBytes / discardReadComponents methods are currently quite inefficient, including when there are no read components to discard. We would like to call the latter more frequently in ByteToMessageDecoder#COMPOSITE_CUMULATOR.
In the same context it would be beneficial to perform a "shallow copy" of a composite buffer (for example when it has a refcount > 1) to avoid having to allocate and copy the contained bytes just to obtain an "independent" cumulation.
Modifications:
- Optimize discardReadBytes() and discardReadComponents() implementations (start at first comp rather than performing a binary search for the readerIndex).
- New addFlattenedComponents(boolean,ByteBuf) method which performs a shallow copy if the provided buffer is also composite and avoids adding any empty buffers, plus unit test.
- Other minor optimizations to avoid unnecessary checks.
Results:
discardReadXX methods are faster, composite buffers can be easily appended without deepening the buffer "tree" or retaining unused components.
Motivation:
We synchronize on the chunk.arena when produce the String returned by PoolSubpage.toString() which may raise a NPE when chunk == null. Chunk == null for the head of the linked-list and so a NPE may raised by a debugger. This NPE can never happen in real code tho as we never access toString() of the head.
Modifications:
Add null checks and so fix the possible NPE
Result:
No NPE when using a debugger and inspect the PooledByteBufAllocator.
Motivation:
In MemoryRegionCache.Entry we use the Recycler to reduce GC pressure and churn. The problem is that these will also be recycled when the PoolThreadCache is collected and finalize() is called. This then can have the effect that we try to load class but the WebApp is already stoped.
This will produce an stacktrace like this on Tomcat:
```
19-Mar-2019 15:53:21.351 INFO [Finalizer] org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading Illegal access: this web application instance has been stopped already. Could not load [java.util.WeakHashMap]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [java.util.WeakHashMap]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1383)
at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1371)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1224)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1186)
at io.netty.util.Recycler$3.initialValue(Recycler.java:233)
at io.netty.util.Recycler$3.initialValue(Recycler.java:230)
at io.netty.util.concurrent.FastThreadLocal.initialize(FastThreadLocal.java:188)
at io.netty.util.concurrent.FastThreadLocal.get(FastThreadLocal.java:142)
at io.netty.util.Recycler$Stack.pushLater(Recycler.java:624)
at io.netty.util.Recycler$Stack.push(Recycler.java:597)
at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:225)
at io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry.recycle(PoolThreadCache.java:478)
at io.netty.buffer.PoolThreadCache$MemoryRegionCache.freeEntry(PoolThreadCache.java:459)
at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:430)
at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:422)
at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:279)
at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:270)
at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:241)
at io.netty.buffer.PoolThreadCache.finalize(PoolThreadCache.java:230)
at java.lang.System$2.invokeFinalize(System.java:1270)
at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:102)
at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:217)
```
Beside this we also need to ensure we not try to lazy load SizeClass when the finalizer is used as it may not be present anymore if the ClassLoader is already destroyed.
This would produce an error like:
```
20-Mar-2019 11:26:35.254 INFO [Finalizer] org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading Illegal access: this web application instance has been stopped already. Could not load [io.netty.buffer.PoolArena$1]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [io.netty.buffer.PoolArena$1]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1383)
at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1371)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1224)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1186)
at io.netty.buffer.PoolArena.freeChunk(PoolArena.java:287)
at io.netty.buffer.PoolThreadCache$MemoryRegionCache.freeEntry(PoolThreadCache.java:464)
at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:429)
at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:421)
at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:278)
at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:269)
at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:240)
at io.netty.buffer.PoolThreadCache.finalize(PoolThreadCache.java:229)
at java.lang.System$2.invokeFinalize(System.java:1270)
at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:102)
at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:217)
```
Modifications:
- Only try to put the Entry back into the Recycler if the PoolThredCache is not destroyed because of the finalizer.
- Only try to access SizeClass if not triggered by finalizer.
Result:
No IllegalStateException anymoe when a webapp is reloaded in Tomcat that uses netty and uses the PooledByteBufAllocator.
Motivation:
The special case fixed in #8497 also requires that we keep a derived slice when trimming components in place, as done by the capacity(int) and discardReadBytes() methods.
Modifications:
Ensure that we keep a ref to trimmed components' original retained slice in capacity(int) and discardReadBytes() methods, so that it is released properly when the they are later freed. Add unit test which fails prior to the fix.
Result:
Edge case leak is eliminated.
Motivation:
PooledByteBufAllocator uses a PoolThreadCache per Thread that allocates / deallocates to minimize the performance overhead. This PoolThreadCache is trimmed after X allocations to free up buffers that are not allocated for a long time. This works out quite well when the app continues to allocate but fails if the app stops to allocate frequently (for whatever reason) and so a lot of memory is wasted and not given back to the arena / freed.
Modifications:
- Add a ThreadExecutorMap that offers multiple methods that wrap Runnable / ThreadFactory / Executor and allow to call ThreadExecutorMap.currentEventExecutor() to get the current executing EventExecutor for the calling Thread.
- Use these methods in the constructors of our EventExecutor implementations (which also covers the EventLoop implementations)
- Add io.netty.allocator.cacheTrimIntervalMillis system property which can be used to specify a fixed rate / interval on which we should try to trim the PoolThreadCache for a EventExecutor that allocates.
- Add PooledByteBufAllocator.trimCurrentThreadCache() to allow the user to trim the cache of the calling thread manually.
- Add testcases
- Introduce FastThreadLocal.getIfExists()
Result:
Allow to better / more frequently trim PoolThreadCache and so give back memory to the area / system.
Motivation:
We should run a CI job using J9 to ensure netty also works when using different JVMs.
Modifications:
- Adjust PooledByteBufAllocatorTest to be able to complete faster when using a JVM which takes longer when joining Threads (this seems to be the case with J9).
- Skip UDT tests on J9 as UDT is not supported there.
Result:
Be able to run CI against J9.
Motivation
There's some miscellaneous cleanup/simplification of CompositeByteBuf
which would help make the code a bit clearer.
Modifications
- Simplify web of constructors and addComponents methods, reducing
duplication of logic
- Rename `Component.freeIfNecessary()` method to just `free()`, which is
less confusing (see #8641)
- Make loop in addComponents0(...) method more verbose/readable (see
https://github.com/netty/netty/pull/8437#discussion_r232124414)
- Simplify addition/subtraction in setBytes(...) methods
Result
Smaller/clearer code
Motivation:
We have a utility method to check for > 0 and >0 arguments. We should use it.
Modification:
use checkPositive/checkPositiveOrZero instead of if statement.
Result:
Re-use utility method.
Motivation:
We need to update to a new checkstyle plugin to allow the usage of lambdas.
Modifications:
- Update to new plugin version.
- Fix checkstyle problems.
Result:
Be able to use checkstyle plugin which supports new Java syntax.
Motivation
In #8758, @doom369 reported an infinite loop bug in CompositeByteBuf
which was introduced in #8437.
This is the same small fix for that, along with fixes for two other bugs
found while re-inspecting the changes and adding unit tests.
Modification
- Replace recursive call to toComponentIndex with toComponentIndex0 as
intended
- Add missed "lastAccessed" racy cache invalidation in capacity(int)
method
- Fix incorrect determination of initial offset in non-zero cIndex case
of updateComponentOffsets method
- New unit tests for previously uncovered methods
Results
Fewer bugs.
Motivation:
The javadocs stating `IndexOutOfBoundsException` is thrown were
different from what `ByteBuf` actually did. We want to ensure the
Javadocs represent reality.
Modifications:
Updated javadocs on `write*`, `ensureWriteable`, `capacity`, and
`maxCapacity` methods.
Results:
Javadocs more closely match actual behaviour.
Motivation:
In versions of Netty prior to 4.1.31.Final, a CompositeByteBuf could be
created with any size (including potentially nonsensical negative
values). This behavior changed in e7737b993, which introduced a bounds
check to only allow for a component size greater than one. This broke
some existing use cases that attempted to create a byte buf with a
single component.
Modifications:
Lower the bounds check on numComponents to include the single component
case, but still throw an exception for anything less than one.
Add unit tests for the case of numComponents being less than, equal to,
and greater than this lower bound.
Result:
Return to the behavior of 4.1.30.Final, allowing one component, but
still include an explicit check against a lower bound.
Note that while creating a CompositeByteBuf with a single component is
in some ways a contradiction of the term "composite", this patch caters
for existing uses while excluding the clearly nonsensical case of asking
for a CompositeByteBuf with zero or fewer components.
Fixes#8613.
Motivation:
Often a temporary ByteBuffer is used which can be cached to reduce the GC pressure.
Modifications:
Cache the ByteBuffer in the PoolThreadCache as well.
Result:
Less GC.
Motivation
#8563 highlighted race conditions introduced by the prior optimistic
update optimization in 83a19d5650. These
were known at the time but considered acceptable given the perf
benefit in high contention scenarios.
This PR proposes a modified approach which provides roughly half the
gains but stronger concurrency semantics. Race conditions still exist
but their scope is narrowed to much less likely cases (releases
coinciding with retain overflow), and even in those
cases certain guarantees are still assured. Once release() returns true,
all subsequent release/retains are guaranteed to throw, and in
particular deallocate will be called at most once.
Modifications
- Use even numbers internally (including -ve) for live refcounts
- "Final" release changes to odd number (equivalent to refcount 0)
- Retain still uses faster getAndAdd, release uses CAS loop
- First CAS attempt uses non-volatile read
- Thread.yield() after a failed CAS provides a net gain
Result
More (though not completely) robust concurrency semantics for ref
counting; increased latency under high contention, but still roughly
twice as fast as the original logic. Bench results to follow
Motivation:
ByteBuf is used everywhere so we should try hard to be able to make things inlinable. During benchmarks it showed that writeCharSequence(...) fails to inline writeUtf8 because it is too big even if its hots.
Modifications:
Move less common code-path to extra method to allow inlining.
Result:
Be able to inline writeUtf8 in most cases.
Motivation:
Often a temporary ByteBuffer is used which can be cached to reduce the GC pressure.
Modifications:
Add a Deque per PoolChunk which will be used for caching.
Result:
Less GC.
Motivation:
When we create new chunk with memory aligned, the offset of direct memory should be
'alignment - address & (alignment - 1)', not just 'address & (alignment - 1)'.
Modification:
Change offset calculating formula to offset = alignment - address & (alignment - 1) in PoolArena.DirectArena#offsetCacheLine and add a unit test to assert that.
Result:
Correctly calculate offset.
Motivation:
ByteBuf.retainedSlice() and similar methods produce sliced buffers with
an independent refcount to the buffer that they wrap.
One of the optimizations in 10539f4dc7 was
to use the ref to the unwrapped buffer object for added slices, but this
did not take into account the above special case when later releasing.
Thanks to @rkapsi for discovering this via #8495.
Modifications:
Since a reference to the slice is still kept in the Component class,
just changed Component.freeIfNecessary() to release the slice in
preference to the unwrapped buf.
Also added a unit test which reproduces the bug.
Result:
Fixes#8495
Motivation:
Two similar bugs were introduced by myself in separate recent PRs #8393
and #8464, while optimizing the assignment/handling of temporary arrays
in ByteBufUtil and UnsafeByteBufUtil.
The temp arrays allocated for buffering data written to an OutputStream
are incorrectly sized to the full length of the data to copy rather than
being capped at WRITE_CHUNK_SIZE.
Unfortunately one of these is in the 4.1.31.Final release, I'm really
sorry and will be more careful in future.
This kind of thing is tricky to cover in unit tests.
Modifications:
Revert the temp array allocations back to their original sizes.
Avoid making duplicate calls to ByteBuf.capacity() in a couple of places
in ByteBufUtil (unrelated thing I noticed, can remove it from this PR if
desired!)
Result:
Temporary byte arrays will be reverted to their originally intended
sizes.
Motivation:
#8388 introduced a reusable ThreadLocal<byte[]> for use in
decodeString(...). It can be used in more places in the buffer package
to avoid temporary allocations of small arrays.
Modifications:
Encapsulate use of the ThreadLocal in a static package-private
ByteBufUtil.threadLocalTempArray(int) method, and make use of it from a
handful of new places including ByteBufUtil.readBytes(...).
Result:
Fewer short-lived small byte array allocations.
Motivation:
CompositeByteBuf is a powerful and versatile abstraction, allowing for
manipulation of large data without copying bytes. There is still a
non-negligible cost to reading/writing however relative to "singular"
ByteBufs, and this can be mostly eliminated with some rework of the
internals.
My use case is message modification/transformation while zero-copy
proxying. For example replacing a string within a large message with one
of a different length
Modifications:
- No longer slice added buffers and unwrap added slices
- Components store target buf offset relative to position in
composite buf
- Less allocations, object footprint, pointer indirection, offset
arithmetic
- Use Component[] rather than ArrayList<Component>
- Avoid pointer indirection and duplicate bounds check, more
efficient backing array growth
- Facilitates optimization when doing bulk-inserts - inserting n
ByteBufs behind m is now O(m + n) instead of O(mn)
- Avoid unnecessary casting and method call indirection via superclass
- Eliminate some duplicate range/ref checks via non-checking versions of
toComponentIndex and findComponent
- Add simple fast-path for toComponentIndex(0); add racy cache of
last-accessed Component to findComponent(int)
- Override forEachByte0(...) and forEachByteDesc0(...) methods
- Make use of RecyclableArrayList in nioBuffers(int, int) (in line with
FasterCompositeByteBuf impl)
- Modify addComponents0(boolean,int,Iterable) to use the Iterable
directly rather than copy to an array first (and possibly to an
ArrayList before that)
- Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform
repeated array insertions and avoid second loop for offset updates
- Simplify other logic in various places, in particular the general
pattern used where a sub-range is iterated over
- Add benchmarks to demonstrate some improvements
While refactoring I also came across a couple of clear bugs. They are
fixed in these changes but I will open another PR with unit tests and
fixes to the current version.
Result:
Much faster creation, manipulation, and access; many fewer allocations
and smaller footprint. Benchmark results to follow.
Motivation:
Unpooled.wrap(byte[]...) and Unpooled.wrap(ByteBuffer...) currently
allocate/copy an intermediate ByteBuf ArrayList and array, which can be
avoided.
Modifications:
- Define new internal ByteWrapper interface and add a CompositeByteBuf
constructor which takes a ByteWrapper with an array of the type that it
wraps, and modify the appropriate Unpooled.wrap(...) methods to take
advantage of it
- Tidy up other constructors in CompositeByteBuf to remove duplication
and misleading len arg (which is really an end offset into provided
array)
Result:
Less allocation/copying when wrapping byte[] and ByteBuffer arrays,
tidier code.
Motivation:
I came across two bugs:
- Components removed due to capacity reduction aren't released
- Offsets aren't set correctly on empty components that are added
between existing components
Modifications:
Add unit tests which expose these bugs, fix them.
Result:
Bugs are fixed
Motivation:
There are currently many more places where this could be used which were
possibly not considered when the method was added.
If https://github.com/netty/netty/pull/8388 is included in its current
form, a number of these places could additionally make use of the same
BYTE_ARRAYS threadlocal.
There's also a couple of adjacent places where an optimistically-pooled
heap buffer is used for temp byte storage which could use the
threadlocal too in preference to allocating a temp heap bytebuf wrapper.
For example
https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L1417.
Modifications:
Replace new byte[] with PlatformDependent.allocateUninitializedArray()
where appropriate; make use of ByteBufUtil.getBytes() in some places
which currently perform the equivalent logic, including avoiding copy of
backing array if possible (although would be rare).
Result:
Further potential speed-up with java9+ and appropriate compile flags.
Many of these places could be on latency-sensitive code paths.
* Optimize AbstractByteBuf.getCharSequence() in US_ASCII case
Motivation:
Inspired by https://github.com/netty/netty/pull/8388, I noticed this
simple optimization to avoid char[] allocation (also suggested in a TODO
here).
Modifications:
Return an AsciiString from AbstractByteBuf.getCharSequence() if
requested charset is US_ASCII or ISO_8859_1 (latter thanks to
@Scottmitch's suggestion). Also tweak unit tests not to require Strings
and include a new benchmark to demonstrate the speedup.
Result:
Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1
cases
Motivation:
CompositeByteBuf.decompose(...) did not correctly slice the content and so produced an incorrect representation of the data.
Modifications:
- Rewrote implementation to fix bug and also improved it to reduce GC
- Add unit tests.
Result:
Fixes https://github.com/netty/netty/issues/8400.
Motivation:
While looking at the nice optimization done in
https://github.com/netty/netty/pull/8347 I couldn't help noticing the
logic could be simplified further. Apologies if this is just my OCD and
inappropriate!
Modifications:
Reduce amount of code used for ByteBufInputStream.readLine()
Result:
Slightly smaller and simpler code
Motivation:
Avoid creating any StringBuilder instance if
ByteBufInputStream::readLine isn't used
Modifications:
The StringBuilder instance is lazy allocated on demand and
are added new test case branches to address the increased
complexity of ByteBufInputStream::readLine
Result:
Reduced GC activity if ByteBufInputStream::readLine isn't used
Motivation:
We should just directly init the refCnt to 1 and not use the AtomicIntegerFieldUpdater.
Modifications:
Just assing directly to 1.
Result:
Cleaner code and possible a bit faster as the JVM / JIT may be able to optimize the first store easily.
Motiviation:
At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives.
It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here.
The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check.
Modifications:
- Try to do a non-volatile read using sun.misc.Unsafe if we can use it.
- Add a benchmark
Result:
Big performance win when multiple ByteBuf methods are called from a method.
With the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 281395842,128 ± 5050792,296 ops/s
Before the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 217419832,801 ± 5080579,030 ops/s
Motivation:
The JVM isn't always able to hoist out/reduce bounds checking (due to ref counting operations etc etc) hence making it configurable could improve performances for most CPU intensive use cases.
Modifications:
Each AbstractByteBuf bounds check has been tested against a new static final configuration property similar to checkAccessible ie io.netty.buffer.bytebuf.checkBounds.
Result:
Any user could disable ByteBuf bounds checking in order to get extra performances.
* Allow to use native transports when sun.misc.Unsafe is not present on the system
Motivation:
We should be able to use the native transports (epoll / kqueue) even when sun.misc.Unsafe is not present on the system. This is especially important as Java11 will be released soon and does not allow access to it by default.
Modifications:
- Correctly disable usage of sun.misc.Unsafe when -PnoUnsafe is used while running the build
- Correctly increment metric when UnpooledDirectByteBuf is allocated. This was uncovered once -PnoUnsafe usage was fixed.
- Implement fallbacks in all our native transport code for when sun.misc.Unsafe is not present.
Result:
Fixes https://github.com/netty/netty/issues/8229.
In nioBuffer(int,int) in CompositeByteBuf , we create a sub-array of nioBuffers for the components that are in range, then concatenate all the components in range into a single bigger buffer.
However, if the call to nioBuffers() returned only one sub-buffer, then we are copying it to a newly-allocated buffer "merged" for no reason.
Motivation:
Profiler for Spark shows a lot of time spent in put() method inside nioBuffer(), while usually no copy of data is required.
Modification:
This change skips this last step and just returns a duplicate of the single buffer returned by the call to nioBuffers(), which will in most implementation not copy the data
Result:
No copy when the source is only 1 buffer