Motivation:
We did not correctly take the position into account when wrapping a ByteBuffer via ReadOnlyUnsafeDirectByteBuf as we obtained the memory address from the original ByteBuffer and not the slice we take.
Modifications:
- Correctly use the slice to obtain memory address.
- Add test case.
Result:
Fixes [#7565].
Motivation:
There is no guarantee that FastThreadLocal.onRemoval(...) is called if the FastThreadLocal is used by "non" FastThreacLocalThreads. This can lead to all sort of problems, like for example memory leaks as direct memory is not correctly cleaned up etc.
Beside this we use ThreadDeathWatcher to check if we need to release buffers back to the pool when thread local caches are collected. In the past ThreadDeathWatcher was used which will need to "wakeup" every second to check if the registered Threads are still alive. If we can ensure FastThreadLocal.onRemoval(...) is called we do not need this anymore.
Modifications:
- Introduce ObjectCleaner and use it to ensure FastThreadLocal.onRemoval(...) is always called when a Thread is collected.
- Deprecate ThreadDeathWatcher
- Add unit tests.
Result:
Consistent way of cleanup FastThreadLocals when a Thread is collected.
Motivation:
We used subList in CompositeByteBuf to remove ranges of elements from the internal storage. Beside this we also used an foreach loop in a few cases which will crate an Iterator.
Modifications:
- Use our own sub-class of ArrayList which exposes removeRange(...). This allows to remove a range of elements without an extra allocation.
- Use an old style for loop to iterate over the elements to reduce object allocations.
Result:
Less allocations.
Automatic-Module-Name entry provides a stable JDK9 module name, when Netty is used in a modular JDK9 applications. More info: http://blog.joda.org/2017/05/java-se-9-jpms-automatic-modules.html
When Netty migrates to JDK9 in the future, the entry can be replaced by actual module-info descriptor.
Modification:
The POM-s are configured to put the correct module names to the manifest.
Result:
Fixes#7218.
Motivation:
We dont need to use the ThreadDeathWatcher if we use a FastThreadLocalThread for which we wrap the Runnable and ensure we call FastThreadLocal.removeAll() once the Runnable completes.
Modifications:
- Dont use a ThreadDeathWatcher if we are sure we will call FastThreadLocal.removeAll()
- Add unit test.
Result:
Less overhead / running theads if you only allocate / deallocate from FastThreadLocalThreads.
Motivation:
AbstractByteBuf#readSlice relied upon the bounds checking of the slice operation in order to detect index out of bounds conditions. However the slice bounds checking operation allows for the slice to go beyond the writer index, and this is out of bounds for a read operation.
Modifications:
- AbstractByteBuf#readSlice and AbstractByteBuf#readRetainedSlice should ensure the desired amount of bytes are readable before taking a slice
Result:
No reading of undefined data in AbstractByteBuf#readSlice and AbstractByteBuf#readRetainedSlice.
Motivation:
When calling CompositeBytebuf.copy() and copy(...) we currently use Unpooled to allocate the buffer. This is not really correct and may produce more GC then needed. We should use the allocator that was used when creating the CompositeByteBuf to allocate the new buffer which may be for example the PooledByteBufAllocator.
Modifications:
- Use alloc() to allocate the new buffer.
- Add tests
- Fix tests that depend on the copy to be backed by an byte-array without checking hasArray() first.
Result:
Fixes [#7393].
Motivation:
Even if it's a super micro-optimization (most JVM could optimize such
cases in runtime), in theory (and according to some perf tests) it
may help a bit. It also makes a code more clear and allows you to
access such methods in the test scope directly, without instance of
the class.
Modifications:
Add 'static' modifier for all methods, where it possible. Mostly in
test scope.
Result:
Cleaner code with proper 'static' modifiers.
Motivation:
Javadoc of the `ByteBufUtil#copy(AsciiString, int, ByteBuf, int, int)` is incorrect.
Modifications:
Fix it.
Result:
The description of the `#copy` method is not misleading.
Motivation:
In the `ByteBufOutputStream` we can use an appropriate methods of `ByteBuf`
to reduce calls of virtual methods and do not copying converting logic.
Modifications:
- Use an appropriate methods of `ByteBuf`
- Remove redundant conversions (int -> byte, int -> char).
- Use `ByteBuf#writeCharSequence` in the `writeBytes(String)'.
Result:
Less code duplication. A `writeBytes(String)` method is faster.
No unnecessary conversions. More consistent and cleaner code.
Configuring this is tough because there is split between highly shared (and accessed) objects and lightly accessed objects.
Modification:
There are a number of changes here. In relative order of importance:
API / Functionality changes:
* Max records and max sample records are gone. Only "target" records, the number of records tries to retain is exposed.
* Records are sampled based on the number of already stored records. The likelihood of recording a new sample is `2^(-n)`, where `n` is the number of currently stored elements.
* Records are stored in a concurrent stack structure rather than a list. This avoids a head and tail. Since the stack is only read once, there is no need to maintain head and tail pointers
* The properties of this imply that the very first and very last access are always recorded. When deciding to sample, the top element is replaced rather than pushed.
* Samples that happen between the first and last accesses now have a chance of being recorded. Previously only the final few were kept.
* Sampling is no longer deterministic. Previously, a deterministic access pattern meant that you could conceivably always miss some access points.
* Sampling has a linear ramp for low values and and exponentially backs off roughly equal to 2^n. This means that for 1,000,000 accesses, about 20 will actually be kept. I have an elegant proof for this which is too large to fit in this commit message.
Code changes:
* All locks are gone. Because sampling rarely needs to do a write, there is almost 0 contention. The dropped records counter is slightly contentious, but this could be removed or changed to a LongAdder. This was not done because of memory concerns.
* Stack trace exclusion is done outside of RLD. Classes can opt to remove some of their methods.
* Stack trace exclusion is faster, since it uses String.equals, often getting a pointer compare due to interning. Previously it used contains()
* Leak printing is outputted fairly differently. I tried to preserve as much of the original formatting as possible, but some things didn't make sense to keep.
Result:
More useful leak reporting.
Faster:
```
Before:
Benchmark (recordTimes) Mode Cnt Score Error Units
ResourceLeakDetectorRecordBenchmark.record 8 thrpt 20 136293.404 ± 7669.454 ops/s
ResourceLeakDetectorRecordBenchmark.record 16 thrpt 20 72805.720 ± 3710.864 ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint 8 thrpt 20 139131.215 ± 4882.751 ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint 16 thrpt 20 74146.313 ± 4999.246 ops/s
After:
Benchmark (recordTimes) Mode Cnt Score Error Units
ResourceLeakDetectorRecordBenchmark.record 8 thrpt 20 155281.969 ± 5301.399 ops/s
ResourceLeakDetectorRecordBenchmark.record 16 thrpt 20 77866.239 ± 3821.054 ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint 8 thrpt 20 153360.036 ± 8611.353 ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint 16 thrpt 20 78670.804 ± 2399.149 ops/s
```
Motivation:
Highly retained and released objects have contention on their ref
count. Currently, the ref count is updated using compareAndSet
with care to make sure the count doesn't overflow, double free, or
revive the object.
Profiling has shown that a non trivial (~1%) of CPU time on gRPC
latency benchmarks is from the ref count updating.
Modification:
Rather than pessimistically assuming the ref count will be invalid,
optimistically update it assuming it will be. If the update was
wrong, then use the slow path to revert the change and throw an
execption. Most of the time, the ref counts are correct.
This changes from using compareAndSet to getAndAdd, which emits a
different CPU instruction on x86 (CMPXCHG to XADD). Because the
CPU knows it will modifiy the memory, it can avoid contention.
On a highly contended machine, this can be about 2x faster.
There is a downside to the new approach. The ref counters can
temporarily enter invalid states if over retained or over released.
The code does handle these overflow and underflow scenarios, but it
is possible that another concurrent access may push the failure to
a different location. For example:
Time 1 Thread 1: obj.retain(INT_MAX - 1)
Time 2 Thread 1: obj.retain(2)
Time 2 Thread 2: obj.retain(1)
Previously Thread 2 would always succeed and Thread 1 would always
fail on the second access. Now, thread 2 could fail while thread 1
is rolling back its change.
====
There are a few reasons why I think this is okay:
1. Buggy code is going to have bugs. An exception _is_ going to be
thrown. This just causes the other threads to notice the state
is messed up and stop early.
2. If high retention counts are a use case, then ref count should
be a long rather than an int.
3. The critical section is greatly reduced compared to the previous
version, so the likelihood of this happening is lower
4. On error, the code always rollsback the change atomically, so
there is no possibility of corruption.
Result:
Faster refcounting
```
BEFORE:
Benchmark (delay) Mode Cnt Score Error Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1 sample 2901361 804.579 ± 1.835 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10 sample 3038729 785.376 ± 16.471 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 100 sample 2899401 817.392 ± 6.668 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1000 sample 3650566 2077.700 ± 0.600 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10000 sample 3005467 19949.334 ± 4.243 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1 sample 456091 48.610 ± 1.162 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10 sample 732051 62.599 ± 0.815 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 100 sample 778925 228.629 ± 1.205 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1000 sample 633682 2002.987 ± 2.856 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10000 sample 506442 19735.345 ± 12.312 ns/op
AFTER:
Benchmark (delay) Mode Cnt Score Error Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1 sample 3761980 383.436 ± 1.315 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10 sample 3667304 474.429 ± 1.101 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 100 sample 3039374 479.267 ± 0.435 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1000 sample 3709210 2044.603 ± 0.989 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10000 sample 3011591 19904.227 ± 18.025 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1 sample 494975 52.269 ± 8.345 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10 sample 771094 62.290 ± 0.795 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 100 sample 763230 235.044 ± 1.552 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1000 sample 634037 2006.578 ± 3.574 ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10000 sample 506284 19742.605 ± 13.729 ns/op
```
Motivation:
Most, but not all defaults are statically exposed on
PooledByteBufAllocator. This makes it cumbersome to make a custom
allocator where most of the defaults remain the same.
Modification:
Expose useCacheForAllThreads, and Direct preferred. The latter is
needed because it is under the internal package, and public code
should probably not depend on it.
Result:
More customizeable allocators
Motivation:
The constrcutors a protected atm but the classes are public. We should make the constructors public as well to make it easier to write your own ByteBufAllocator.
Modifications:
Change constructors to be public and add some javadocs.
Result:
Easier to create own ByteBufAllocator.
Motivation:
`useCacheForAllThreads` may be false which disables memory caching
on non netty threads. Setting this argument or the system property
makes it impossible to use `PooledByteBufAllocator`.
Modifications:
Delayed the check of `freeSweepAllocationThreshold` in
`PoolThreadCache` to after it knows there will be any caches in
use. Additionally, check if the caches will have any data in them
(rather than allocating a 0-length array).
A test case is also added that fails without this change.
Results:
Fixes#7194
Motivation:
When the user want to have the direct memory explicitly managed by the GC (just as java.nio does) it is useful to be able to construct an UnpooledByteBufAllocator that allows this without the chances to see any memory leak.
Modifications:
Allow to explicitly disable the usage of reflection to construct direct ByteBufs and so be sure these will be collected by GC.
Result:
More flexible way to use the UnpooledByteBufAllocator.
Motivation:
The documentation for field updates says:
> Note that the guarantees of the {@code compareAndSet}
> method in this class are weaker than in other atomic classes.
> Because this class cannot ensure that all uses of the field
> are appropriate for purposes of atomic access, it can
> guarantee atomicity only with respect to other invocations of
> {@code compareAndSet} and {@code set} on the same updater.
This implies that volatiles shouldn't use normal assignment; the
updater should set them.
Modifications:
Use setter for field updaters that make use of compareAndSet.
Result:
Concurrency compliant code
Motivation:
In ReadOnlyByteBufferBuf.copy(...) we just allocated a ByteBuffer directly and wrapped it. This way it was not possible for us to free the direct memory that was used by the copy without the GC.
Modifications:
- Ensure we use the allocator when create the copy and so be able to release direct memory in a timely manner
- Add unit test
- Depending on if the to be copied buffer is direct or heap based we also allocate the same type on copy.
Result:
Fixes [#7103].
Motivation:
`ByteBuf` does not have the little endian variant of float/double access methods.
Modifications:
Add support for little endian floats and doubles into `ByteBuf`.
Result:
`ByteBuf` has get/read/set/writeFloatLE() and get/read/set/writeDoubleLE() methods. Fixes [#6576].
Motivation:
Missing return in ByteBufUtil#writeAscii causes endless loop
Modifications:
Add return after write finished
Result:
ByteBufUtil#writeAscii is ok
Motivation:
ByteBuf#ensureWritable(int,boolean) returns an int indicating the status of the resize operation. For buffers that are unmodifiable or cannot be resized this method shouldn't throw but just return 1.
ByteBuf#ensureWriteable(int) should throw unmodifiable buffers.
Modifications:
- ReadOnlyByteBuf should be updated as described above.
- Add a unit test to SslHandler which verifies the read only buffer can be tolerated in the aggregation algorithm.
Result:
Fixes https://github.com/netty/netty/issues/7002.
Motivation:
We need to ensure we not allow calling set/writeCharsequence on an released ByteBuf.
Modifications:
Add test-cases
Result:
Proves fix of [#6951].
Motivation:
AbstractByteBuf.setCharSequence(...) must not expand the buffer if not enough writable space is present in the buffer to be consistent with all the other set operations.
Modifications:
- Ensure we only exand the buffer on writeCharSequence(...) but not on setCharSequence(...)
- Add unit tests.
Result:
Consistent and correct behavior.
Motivation:
AbstractByteBuf.ensureWritable(...) should check if buffer was released and if so throw an IllegalReferenceCountException
Modifications:
Ensure we throw in all cases.
Result:
More consistent and correct behaviour
Motivation:
It would be easier to find where is missing release call in several retain release calls on a ByteBuf
Modifications:
Remove final modifier on SimpleLeakAwareByteBuf and SimpleLeakAwareByteBuf release function and override it to record release in AdvancedLeakAwareByteBuf and AdvancedLeakAwareCompositeByteBuf
Result:
Release will be recorded when enable detailed leak detection
Motivation:
Each call to SSL_write may introduce about ~100 bytes of overhead. The OpenSslEngine (based upon OpenSSL) is not able to do gathering writes so this means each wrap operation will incur the ~100 byte overhead. This commit attempts to increase goodput by aggregating the plaintext in chunks of <a href="https://tools.ietf.org/html/rfc5246#section-6.2">2^14</a>. If many small chunks are written this can increase goodput, decrease the amount of calls to SSL_write, and decrease overall encryption operations.
Modifications:
- Introduce SslHandlerCoalescingBufferQueue in SslHandler which will aggregate up to 2^14 chunks of plaintext by default
- Introduce SslHandler#setWrapDataSize to control how much data should be aggregated for each write. Aggregation can be disabled by setting this value to <= 0.
Result:
Better goodput when using SslHandler and the OpenSslEngine.
Motivation:
1. Some encoders used a `ByteBuf#writeBytes` to write short constant byte array (2-3 bytes). This can be replaced with more faster `ByteBuf#writeShort` or `ByteBuf#writeMedium` which do not access the memory.
2. Two chained calls of the `ByteBuf#setByte` with constants can be replaced with one `ByteBuf#setShort` to reduce index checks.
3. The signature of method `HttpHeadersEncoder#encoderHeader` has an unnecessary `throws`.
Modifications:
1. Use `ByteBuf#writeShort` or `ByteBuf#writeMedium` instead of `ByteBuf#writeBytes` for the constants.
2. Use `ByteBuf#setShort` instead of chained call of the `ByteBuf#setByte` with constants.
3. Remove an unnecessary `throws` from `HttpHeadersEncoder#encoderHeader`.
Result:
A bit faster writes constants into buffers.
Motivation:
We should also use realloc when shrink the buffer to eliminate extra allocations / memory copies when possible.
Modifications:
Use realloc for expanding and shrinking when possible.
Result:
Less memory copies and allocations
Motivation:
Methods `ByteBufUtil#writeUtf8` and `ByteBufUtil#writeAscii` contains a check `ByteBuf#ensureWritable` before the calling `ByteBuf#writeBytes`. But the `ByteBuf#writeBytes` also do a such check inside.
Modifications:
Make checks more targeted.
Result:
Less redundant method calls.
Motivation:
1. `ByteBuf` contains methods to writing `CharSequence` which optimized for UTF-8 and ASCII encodings. We can also apply optimization for ISO-8859-1.
2. In many places appropriate methods are not used.
Modifications:
1. Apply optimization for ISO-8859-1 encoding in the `ByteBuf#setCharSequence` realizations.
2. Apply appropriate methods for writing `CharSequences` into buffers.
Result:
Reduce overhead from string-to-bytes conversion.
Motivation:
PR #6811 introduced a public utility methods to decode hex dump and its parts, but they are not visible from netty-common.
Modifications:
1. Move the `decodeHexByte`, `decodeHexDump` and `decodeHexNibble` methods into `StringUtils`.
2. Apply these methods where applicable.
3. Remove similar methods from other locations (e.g. `HpackHex` test class).
Result:
Less code duplication.
Motivation:
We should allow to access the memoryAddress of the wrapped ByteBuf when using ReadOnlyByteBuf for peformance reasons. If a user act on a memoryAddress its his responsible anyway to do nothing "stupid".
Modifications:
Delegate to wrapped ByteBuf.
Result:
Less performance overhead for various operations and also when writing to a native transport (which needs the memoryAddress).
Motivations:
1. There are duplicated implementations of decoding hex strings. #6797
2. ByteBufUtil.HexUtil.decodeHexDump does not handle substring start
index properly and does not decode hex byte rigorously.
Modifications:
1. Function decodeHexByte is moved from QueryStringDecoder into ByteBufUtil.
2. ByteBufUtil.HexUtil.decodeHexDump is changed to use decodeHexByte.
3. Tests are Updated accordingly.
Result:
Fixed#6797 and made hex decoding functions more robust.