Motivation:
https://github.com/netty/netty/pull/10267 introduced a change that reduced the fragmentation. Unfortunally it also introduced a regression when it comes to caching of normal allocations. This can have a negative performance impact depending on the allocation sizes.
Modifications:
- Fix algorithm to calculate the array size for normal allocation caches
- Correctly calculate indeox for normal caches
- Add unit test
Result:
Fixes https://github.com/netty/netty/issues/10805
Fix issue #10508 where PARANOID mode slow down about 1000 times compared to ADVANCED.
Also fix a rare issue when internal buffer was growing over a limit, it was partially discarded
using `discardReadBytes()` which causes bad changes within previously discovered HttpData.
Reasons were:
Too many `readByte()` method calls while other ways exist (such as keep in memory the last scan position when trying to find a delimiter or using `bytesBefore(firstByte)` instead of looping externally).
Changes done:
- major change on way buffer are parsed: instead of read byte per byte until found delimiter, try to find the delimiter using `bytesBefore()` and keep the last unfound position to skeep already parsed parts (algorithms are the same but implementation of scan are different)
- Change the condition to discard read bytes when refCnt is at most 1.
Observations using Async-Profiler:
==================================
1) Without optimizations, most of the time (more than 95%) is through `readByte()` method within `loadDataMultipartStandard` method.
2) With using `bytesBefore(byte)` instead of `readByte()` to find various delimiter, the `loadDataMultipartStandard` method is going down to 19 to 33% depending on the test used. the `readByte()` method or equivalent `getByte(pos)` method are going down to 15% (from 95%).
Times are confirming those profiling:
- With optimizations, in SIMPLE mode about 82% better, in ADVANCED mode about 79% better and in PARANOID mode about 99% better (most of the duplicate read accesses are removed or make internally through `bytesBefore(byte)` method)
A benchmark is added to show the behavior of the various cases (one big item, such as File upload, and many items) and various level of detection (Disabled, Simple, Advanced, Paranoid). This benchmark is intend to alert if new implementations make too many differences (such as the previous version where about PARANOID gives about 1000 times slower than other levels, while it is now about at most 10 times).
Extract of Benchmark run:
=========================
Run complete. Total time: 00:13:27
Benchmark Mode Cnt Score Error Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel thrpt 6 2,248 ± 0,198 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel thrpt 6 2,067 ± 1,219 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel thrpt 6 1,109 ± 0,038 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel thrpt 6 2,326 ± 0,314 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel thrpt 6 1,444 ± 0,226 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel thrpt 6 1,462 ± 0,642 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel thrpt 6 0,159 ± 0,003 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel thrpt 6 1,522 ± 0,049 ops/ms
Motivation:
HTTP is a plaintext protocol which means that someone may be able
to eavesdrop the data. To prevent this, HTTPS should be used whenever
possible. However, maintaining using https:// in all URLs may be
difficult. The nohttp tool can help here. The tool scans all the files
in a repository and reports where http:// is used.
Modifications:
- Added nohttp (via checkstyle) into the build process.
- Suppressed findings for the websites
that don't support HTTPS or that are not reachable
Result:
- Prevent using HTTP in the future.
- Encourage users to use HTTPS when they follow the links they found in
the code.
Motivation:
DefaultAttributeMap::attr has a blocking behaviour on lookup of an existing attribute:
it can be made non-blocking.
Modification:
Replace the existing fixed bucket table using a locked intrusive linked list
with an hand-rolled copy-on-write ordered single array
Result:
Non blocking behaviour for the lookup happy path
Motivation:
writeUtf8 can suffer from inlining issues and/or megamorphic call-sites on the hot path due to ByteBuf hierarchy
Modifications:
Duplicate and specialize the code paths to reduce the need of polymorphic calls
Result:
Performance are more stable in user code
Reduce garbage on MQTT encoding
Motivation:
MQTT encoding and decoding is doing unnecessary object allocation in a number of places:
- MqttEncoder create many byte[] to encode Strings into UTF-8 bytes
- MqttProperties uses Integer keys instead of int
- Some enums valueOf create unnecessary arrays on the hot paths
- MqttDecoder was using unecessary Result<T>
Modification:
- ByteBufUtil::utf8Bytes and ByteBufUtil::reserveAndWriteUtf8 allows to perform the same operation GC-free
- MqttProperties uses a primitive key map
- Implemented GC free const table lookup/switch valueOf
- Use some bit-tricks to pack 2 ints into a single primitive long to store both result and numberOfBytesConsumed and use byte[].length to compute numberOfByteConsumed on fly. These changes allowed to save creating Result<T>.
Result:
Significantly less garbage produced in MQTT encoding/decoding
Motivation:
We have found out that ByteBufUtil.indexOf can be inefficient for substring search on
ByteBuf, both in terms of algorithm complexity (worst case O(needle.readableBytes *
haystack.readableBytes)), and in constant factor (esp. on Composite buffers).
With implementation of more performant search algorithms we have seen improvements on
the order of magnitude.
Modifications:
This change introduces three search algorithms:
1. Knuth Morris Pratt - classical textbook algorithm, a good default choice.
2. Bit mask based algorithm - stable performance on any input, but limited to maximum
search substring (the needle) length of 64 bytes.
3. Aho–Corasick - worse performance and higher memory consumption than [1] and [2], but
it supports multiple substring (the needles) search simultaneously, by inspecting every
byte of the haystack only once.
Each algorithm processes every byte of underlying buffer only once, they are implemented
as ByteProcessor.
Result:
Efficient search algorithms with linear time complexity available in Netty (I will share
benchmark results in a comment on a PR).
Motivation:
PoolChunk.usage() method has non-trivial computations. It is used currently in hot path methods invoked when an allocation and de-allocation are happened.
The idea is to replace usage() output comparison against percent thresholds by Chunk.freeBytes plain comparison against absolute thresholds. In such way the majority of computations from the threshold conditions are moved to init logic.
Modifications:
Replace PoolChunk.usage() conditions in PoolChunkList with equivalent conditions for PoolChunk.freeBytes()
Result:
Improve performance of allocation and de-allocation of ByteBuf from normal size cache pool
Motivation:
decodeHexNibble can be a lot faster using a lookup table
Modifications:
decodeHexNibble is made faster by using a lookup table
Result:
decodeHexNibble is faster
Motivation:
Currently, characters are appended to the encoded string char-by-char even when no encoding is needed. We can instead separate out codepath that appends the entire string in one go for better `StringBuilder` allocation performance.
Modification:
Only go into char-by-char loop when finding a character that requires encoding.
Result:
The results aren't so clear with noise on my hot laptop - the biggest impact is on long strings, both to reduce resizes of the buffer and also to reduce complexity of the loop. I don't think there's a significant downside though for the cases that hit the slow path.
After
```
Benchmark Mode Cnt Score Error Units
QueryStringEncoderBenchmark.longAscii thrpt 6 1.406 ± 0.069 ops/us
QueryStringEncoderBenchmark.longAsciiFirst thrpt 6 0.046 ± 0.001 ops/us
QueryStringEncoderBenchmark.longUtf8 thrpt 6 0.046 ± 0.001 ops/us
QueryStringEncoderBenchmark.shortAscii thrpt 6 15.781 ± 0.949 ops/us
QueryStringEncoderBenchmark.shortAsciiFirst thrpt 6 3.171 ± 0.232 ops/us
QueryStringEncoderBenchmark.shortUtf8 thrpt 6 3.900 ± 0.667 ops/us
```
Before
```
Benchmark Mode Cnt Score Error Units
QueryStringEncoderBenchmark.longAscii thrpt 6 0.444 ± 0.072 ops/us
QueryStringEncoderBenchmark.longAsciiFirst thrpt 6 0.043 ± 0.002 ops/us
QueryStringEncoderBenchmark.longUtf8 thrpt 6 0.047 ± 0.001 ops/us
QueryStringEncoderBenchmark.shortAscii thrpt 6 16.503 ± 1.015 ops/us
QueryStringEncoderBenchmark.shortAsciiFirst thrpt 6 3.316 ± 0.154 ops/us
QueryStringEncoderBenchmark.shortUtf8 thrpt 6 3.776 ± 0.956 ops/us
```
Motivation:
In Java, it is almost always at least slower to use `ByteBuffer` than `byte[]` without pooling or I/O. `QueryStringDecoder` can use `byte[]` with arguably simpler code.
Modification:
Replace `ByteBuffer` / `CharsetDecoder` with `byte[]` and `new String`
Result:
After
```
Benchmark Mode Cnt Score Error Units
QueryStringDecoderBenchmark.noDecoding thrpt 6 5.612 ± 2.639 ops/us
QueryStringDecoderBenchmark.onlyDecoding thrpt 6 1.393 ± 0.067 ops/us
QueryStringDecoderBenchmark.mixedDecoding thrpt 6 1.223 ± 0.048 ops/us
```
Before
```
Benchmark Mode Cnt Score Error Units
QueryStringDecoderBenchmark.noDecoding thrpt 6 6.123 ± 0.250 ops/us
QueryStringDecoderBenchmark.onlyDecoding thrpt 6 0.922 ± 0.159 ops/us
QueryStringDecoderBenchmark.mixedDecoding thrpt 6 1.032 ± 0.178 ops/us
```
I notice #6781 switched from an array to `ByteBuffer` but I can't find any motivation for that in the PR. Unit tests pass fine with an array and we get a reasonable speed bump.
Motivation
Currently when future tasks are scheduled via EventExecutors from a
different thread, at least two allocations are performed - the
ScheduledFutureTask wrapping the to-be-run task, and a Runnable wrapping
the action to add to the scheduled task priority queue. The latter can
be avoided by incorporating this logic into the former.
Modification
- When scheduling or cancelling a future task from outside the event
loop, enqueue the task itself rather than wrapping in a Runnable
- Have ScheduledFutureTask#run first verify the task's deadline has
passed and if not add or remove it from the scheduledTaskQueue depending
on its cancellation state
- Add new outside-event-loop benchmarks to ScheduleFutureTaskBenchmark
Result
Fewer allocations when scheduling/cancelling future tasks
Motivation
Currently a static AtomicLong is used to allocate a unique id whenever a
task is scheduled to any event loop. This could be a source of
contention if delayed tasks are scheduled at a high frequency and can be
easily avoided by having a non-volatile id counter per queue.
Modifications
- Replace static AtomicLong ScheduledFutureTask#nextTaskId with a long
field in AbstractScheduledExecutorService
- Set ScheduledFutureTask#id based on this when adding the task to the
queue (in event loop) instead of at construction time
- Add simple benchmark
Result
Less contention / cache-miss possibility when scheduling future tasks
Before:
Benchmark (num) Mode Cnt Score Error Units
scheduleLots 100000 thrpt 20 346.008 ± 21.931 ops/s
Benchmark (num) Mode Cnt Score Error Units
scheduleLots 100000 thrpt 20 654.824 ± 22.064 ops/s
Motivation:
The previous used maxHeaderListSize was too low which resulted in exceptions during the benchmark run:
```
io.netty.handler.codec.http2.Http2Exception: Header size exceeded max allowed size (8192)
at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:103)
at io.netty.handler.codec.http2.Http2Exception.headerListSizeError(Http2Exception.java:188)
at io.netty.handler.codec.http2.Http2CodecUtil.headerListSizeExceeded(Http2CodecUtil.java:231)
at io.netty.handler.codec.http2.HpackDecoder$Http2HeadersSink.finish(HpackDecoder.java:545)
at io.netty.handler.codec.http2.HpackDecoder.decode(HpackDecoder.java:132)
at io.netty.handler.codec.http2.HpackDecoderBenchmark.decode(HpackDecoderBenchmark.java:85)
at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_thrpt_jmhStub(HpackDecoderBenchmark_decode_jmhTest.java:120)
at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_Throughput(HpackDecoderBenchmark_decode_jmhTest.java:83)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
```
Also we should ensure we only use ascii for header names.
Modifications:
Just use Integer.MAX_VALUE as limit
Result:
Be able to run benchmark without exceptions
Motivation:
Some methods that either override others or are implemented as part of implementation an interface did miss the `@Override` annotation
Modifications:
Add missing `@Override`s
Result:
Code cleanup
Motivation:
SpotJMHBugs reports that accumulating a value as a way of eliding dead code
elimination may be inadvisable, as discussed in
`JMHSample_34_SafeLooping::measureWrong_2`. Change the test so that it consumes
the response with `Blackhole::consume` instead.
Modifications:
- Replace addition of results with explicit `blackhole.consume()` call
Result:
Tests work as before, but with different benchmark numbers.
Motivation:
Some JMH benchmarks need additional explanations to motivate
specific code choices.
Modifications:
Introduced comment to explai why calling BlackHole::consume
in a loop is not always the right choice for some benchmark.
Result:
The relevant method shows a comment that warn about changing
the code to introduce BlackHole::consume in the loop.
Motivation:
Resolve the issue highlighted by SpotJMHBugs that the creation of the RecyclableArrayList may be elided by the JIT since the result isn't consumed or returned.
Modifications:
Return the result of `list.recycle()` so that the list isn't elided.
Result:
The JMH benchmark shows a change in performance indicating that the prior results of this may be unsound.
Motivation:
The wakeup logic in EpollEventLoop is overly complex
Modification:
* Simplify the race to wakeup the loop
* Dont let the event loop wake up itself (it's already awake!)
* Make event loop check if there are any more tasks after preparing to
sleep. There is small window where the non-eventloop writers can issue
eventfd writes here, but that is okay.
Result:
Cleaner wakeup logic.
Benchmarks:
```
BEFORE
Benchmark Mode Cnt Score Error Units
EpollSocketChannelBenchmark.executeMulti thrpt 20 408381.411 ± 2857.498 ops/s
EpollSocketChannelBenchmark.executeSingle thrpt 20 157022.360 ± 1240.573 ops/s
EpollSocketChannelBenchmark.pingPong thrpt 20 60571.704 ± 331.125 ops/s
Benchmark Mode Cnt Score Error Units
EpollSocketChannelBenchmark.executeMulti thrpt 20 440546.953 ± 1652.823 ops/s
EpollSocketChannelBenchmark.executeSingle thrpt 20 168114.751 ± 1176.609 ops/s
EpollSocketChannelBenchmark.pingPong thrpt 20 61231.878 ± 520.108 ops/s
```
Motivation
Pipeline handlers are free to "take control" of input buffers if they have singular refcount - in particular to mutate their raw data if non-readonly via discarding of read bytes, etc.
However there are various places (primarily unit tests) where a wrapped byte-array buffer is passed in and the wrapped array is assumed not to change (used after the wrapped buffer is passed to EmbeddedChannel.writeInbound()). This invalid assumption could result in unexpected errors, such as those exposed by #8931.
Modifications
Anywhere that the data passed to writeInbound() might be used again, ensure that either:
- A copy is used rather than wrapping a shared byte array, or
- The buffer is otherwise protected from modification by making it read-only
For the tests, copying is preferred since it still allows the "mutating" optimizations to be exercised.
Results
Avoid possible errors when pipeline assumes it has full control of input buffer.
Motivation:
Results are just wrong for small delays.
Modifications:
Switching to AvarageTime avoid to rely on OS nanoTime granularity.
Result:
Uncontended low delay results are not reliable
Motivation:
Invoking ChannelHandlers is not free and can result in some overhead when the ChannelPipeline becomes very long. This is especially true if most handlers will just forward the call to the next handler in the pipeline. When the user extends Channel*HandlerAdapter we can easily detect if can just skip the handler and invoke the next handler in the pipeline directly. This reduce the overhead of dispatch but also reduce the call-stack in many cases.
This backports https://github.com/netty/netty/pull/8723 and https://github.com/netty/netty/pull/8987 to 4.1
Modifications:
Detect if we can skip the handler when walking the pipeline.
Result:
Reduce overhead for long pipelines.
Benchmark (extraHandlers) Mode Cnt Score Error Units
DefaultChannelPipelineBenchmark.propagateEventOld 4 thrpt 10 267313.031 ± 9131.140 ops/s
DefaultChannelPipelineBenchmark.propagateEvent 4 thrpt 10 824825.673 ± 12727.594 ops/s
Motivation:
Just was looking through code and found 1 interesting place DateFormatter.tryParseMonth that was not very effective, so I decided to optimize it a bit.
Modification:
Changed DateFormatter.tryParseMonth method. Instead of invocation regionMatch() for every month - compare chars one by one.
Result:
DateFormatter.parseHttpDate method performance improved from ~3% to ~15%.
Benchmark (DATE_STRING) Mode Cnt Score Error Units
DateFormatter2Benchmark.parseHttpHeaderDateFormatter Sun, 27 Jan 2016 19:18:46 GMT thrpt 6 4142781.221 ± 82155.002 ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatter Sun, 27 Dec 2016 19:18:46 GMT thrpt 6 3781810.558 ± 38679.061 ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew Sun, 27 Jan 2016 19:18:46 GMT thrpt 6 4372569.705 ± 30257.537 ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew Sun, 27 Dec 2016 19:18:46 GMT thrpt 6 4339785.100 ± 57542.660 ops/s
Motivation:
We need to update to a new checkstyle plugin to allow the usage of lambdas.
Modifications:
- Update to new plugin version.
- Fix checkstyle problems.
Result:
Be able to use checkstyle plugin which supports new Java syntax.
Motivation:
Netty executors doesn't have yet any means to compare with each others
nor to compare with the j.u.c. executors
Modifications:
A new benchmark measuring execute burst cost is being added
Result:
It's now possible to compare some of Netty executors with each others
and with the j.u.c. executors
Motivation:
CompositeByteBuf is a powerful and versatile abstraction, allowing for
manipulation of large data without copying bytes. There is still a
non-negligible cost to reading/writing however relative to "singular"
ByteBufs, and this can be mostly eliminated with some rework of the
internals.
My use case is message modification/transformation while zero-copy
proxying. For example replacing a string within a large message with one
of a different length
Modifications:
- No longer slice added buffers and unwrap added slices
- Components store target buf offset relative to position in
composite buf
- Less allocations, object footprint, pointer indirection, offset
arithmetic
- Use Component[] rather than ArrayList<Component>
- Avoid pointer indirection and duplicate bounds check, more
efficient backing array growth
- Facilitates optimization when doing bulk-inserts - inserting n
ByteBufs behind m is now O(m + n) instead of O(mn)
- Avoid unnecessary casting and method call indirection via superclass
- Eliminate some duplicate range/ref checks via non-checking versions of
toComponentIndex and findComponent
- Add simple fast-path for toComponentIndex(0); add racy cache of
last-accessed Component to findComponent(int)
- Override forEachByte0(...) and forEachByteDesc0(...) methods
- Make use of RecyclableArrayList in nioBuffers(int, int) (in line with
FasterCompositeByteBuf impl)
- Modify addComponents0(boolean,int,Iterable) to use the Iterable
directly rather than copy to an array first (and possibly to an
ArrayList before that)
- Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform
repeated array insertions and avoid second loop for offset updates
- Simplify other logic in various places, in particular the general
pattern used where a sub-range is iterated over
- Add benchmarks to demonstrate some improvements
While refactoring I also came across a couple of clear bugs. They are
fixed in these changes but I will open another PR with unit tests and
fixes to the current version.
Result:
Much faster creation, manipulation, and access; many fewer allocations
and smaller footprint. Benchmark results to follow.
* Optimize AbstractByteBuf.getCharSequence() in US_ASCII case
Motivation:
Inspired by https://github.com/netty/netty/pull/8388, I noticed this
simple optimization to avoid char[] allocation (also suggested in a TODO
here).
Modifications:
Return an AsciiString from AbstractByteBuf.getCharSequence() if
requested charset is US_ASCII or ISO_8859_1 (latter thanks to
@Scottmitch's suggestion). Also tweak unit tests not to require Strings
and include a new benchmark to demonstrate the speedup.
Result:
Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1
cases
Motiviation:
At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives.
It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here.
The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check.
Modifications:
- Try to do a non-volatile read using sun.misc.Unsafe if we can use it.
- Add a benchmark
Result:
Big performance win when multiple ByteBuf methods are called from a method.
With the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 281395842,128 ± 5050792,296 ops/s
Before the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 217419832,801 ± 5080579,030 ops/s
Motivation:
Some of the flags we used are not supported anymore on more recent JDK versions. We should just remove all of them and only keep what we really need. This may also reflect better what people use in production.
Modifications:
Remove some flags when running the benchmarks.
Result:
Benchmarks also run with JDK11.
Motivation:
Optimizing the Epoll channel needs an objective measure of how fast
it is.
Modification:
Add a simple, closed loop, ping-pong benchmark.
Result:
Benchmark can be used to measure #7816
Initial numbers:
```
Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong":
22614.403 ±(99.9%) 797.263 ops/s [Average]
(min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130
CI (99.9%): [21817.140, 23411.666] (assumes normal distribution)
Benchmark Mode Cnt Score Error Units
EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s
```
Motivation:
The JVM isn't always able to hoist out/reduce bounds checking (due to ref counting operations etc etc) hence making it configurable could improve performances for most CPU intensive use cases.
Modifications:
Each AbstractByteBuf bounds check has been tested against a new static final configuration property similar to checkAccessible ie io.netty.buffer.bytebuf.checkBounds.
Result:
Any user could disable ByteBuf bounds checking in order to get extra performances.
Motivation:
The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant overhead
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).
In the case of CodecInputListBenchmark, benchmarks are far too small (less than 50ns) and the Invocation
level setup offsets the measurement considerably.
On such cases, the recommended fix patch is to include the setup/teardown code in the benchmark method.
Modifications:
Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown methods from the benchmark class.
Result:
We run the entire benchmark 10 times with default parameters we observed:
- ArrayList benchmark affected directly by JMH overhead is now from 15-80% faster.
- CodecList benchmark is now 50% faster than original (even with the setup code being measured).
- Recyclable ArrayList is ~30% slower.
- All benchmarks have significant different means (ANOVA) and medians (Moore)
Mode: Throughput (Higher the better)
Method Full params Factor Modified (Median) Original (Median)
recyclableArrayList (elements = 1) 0.615520967 21719082.75 35285691.2
recyclableArrayList (elements = 4) 0.699553431 17149442.76 24514843.31
arrayList (elements = 4) 1.152666631 27120407.18 23528404.88
codecOutList (elements = 1) 1.527275908 67251089.04 44033359.47
codecOutList (elements = 4) 1.596917095 59174088.78 37055204.03
arrayList (elements = 1) 1.878616889 62188238.24 33103204.06
Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.
Motivation:
The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant impact in
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).
When the benchmark and the setup/teardown is too small (less than a milisecond) the Invocation level might saturate the system with
timestamp requests and iteration synchronizations which introduce artificial latency, throughput, and scalability bottlenecks.
In the HeadersBenchmark, all benchmarks take less than 100ns and the Invocation level setup offsets the measurement considerably.
As fixture methods is defined for the entire class, this overhead also impacts every single benchmark in this class, not only
the ones that use the emptyHttpHeaders object (cleaned in the setup).
The recommended fix patch here is to include the setup/teardown code in the benchmark where the object is used.
Modifications:
Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown method of Invocation level from the benchmark class.
Result:
We run all benchmarks from HeadersBenchmark 10 times with default parameter, we observe:
- Benchmarks that were not directly affected by the fix patch, improved execution time.
For instance, http2Remove with (exampleHeader = THREE) had its median reported as 2x faster than the original version.
- Benchmarks that had the setup code inserted (eg. http2AddAllFastest) did not suffer a significant punch in the execution time,
as the benchmarks are not dominated by the clear().
Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.
Motivation:
NetUtilBenchmark is using out of date data, throws an exception in the benchmark, and allocates a Set on each run.
Modifications:
- Update the benchmark and reduce each run's overhead
Result:
NetUtilBenchmark is updated.
Motivation:
Read-only heap ByteBuffer doesn't expose array: the existent method to perform copies to direct ByteBuf involves the creation of a (maybe pooled) additional heap ByteBuf instance and copy
Modifications:
To avoid stressing the allocator with additional (and stealth) heap ByteBuf allocations is provided a method to perform copies using the (pooled) internal NIO buffer
Result:
Copies from read-only heap ByteBuffer to direct ByteBuf won't create any intermediate ByteBuf
Motivation:
According to the spec:
All pseudo-header fields MUST appear in the header block before regular
header fields. Any request or response that contains a pseudo-header
field that appears in a header block after
a regular header field MUST be treated as malformed (Section 8.1.2.6).
Pseudo-header fields are only valid in the context in which they are defined.
Pseudo-header fields defined for requests MUST NOT appear in responses;
pseudo-header fields defined for responses MUST NOT appear in requests.
Pseudo-header fields MUST NOT appear in trailers.
Endpoints MUST treat a request or response that contains undefined or
invalid pseudo-header fields as malformed (Section 8.1.2.6).
Clients MUST NOT accept a malformed response. Note that these requirements
are intended to protect against several types of common attacks against HTTP;
they are deliberately strict because being permissive can expose
implementations to these vulnerabilities.
Modifications:
- Introduce validation in HPackDecoder
Result:
- Requests with unknown pseudo-field headers are rejected
- Requests with containing response specific pseudo-headers are rejected
- Requests where pseudo-header appear after regular header are rejected
- h2spec 8.1.2.1 pass
Motivation:
We used Recycler for the CodecOutputList which is not optimized for the use-case of access only from the same Thread all the time.
Modifications:
- Use FastThreadLocal for CodecOutputList
- Add benchmark
Result:
Less overhead in our codecs.
Motivation:
The JMH doc suggests to use BlackHoles to avoid dead code elimination hence would be better to follow this best practice.
Modifications:
Each benchmark method is returning the ByteBuf/ByteBuffer to avoid the JVM to perform any dead code elimination.
Result:
The results are more reliable and comparable to the others provided by other ByteBuf benchmarks (eg HeapByteBufBenchmark)
Motivation:
HttpMethod#valueOf shows up on profiler results in the top set of
results. Since it is a relatively simple operation it can be improved in
isolation.
Modifications:
- Introduce a special case map which assigns each HttpMethod to a unique
index in an array and provides constant time lookup from a hash code
algorithm. When the bucket is matched we can then directly do equality
comparison instead of potentially following a linked structure when
HashMap has hash collisions.
Result:
~10% improvement in benchmark results for HttpMethod#valueOf
Benchmark Mode Cnt Score Error Units
HttpMethodMapBenchmark.newMapKnownMethods thrpt 16 31.831 ± 0.928 ops/us
HttpMethodMapBenchmark.newMapMixMethods thrpt 16 25.568 ± 0.400 ops/us
HttpMethodMapBenchmark.newMapUnknownMethods thrpt 16 51.413 ± 1.824 ops/us
HttpMethodMapBenchmark.oldMapKnownMethods thrpt 16 29.226 ± 0.330 ops/us
HttpMethodMapBenchmark.oldMapMixMethods thrpt 16 21.073 ± 0.247 ops/us
HttpMethodMapBenchmark.oldMapUnknownMethods thrpt 16 49.081 ± 0.577 ops/us
Motivation:
DefaultHttp2FrameWriter#writeData allocates a DataFrameHeader for each write operation. DataFrameHeader maintains internal state and allocates multiple slices of a buffer which is a maximum of 30 bytes. This 30 byte buffer may not always be necessary and the additional slice operations can utilize retainedSlice to take advantage of pooled objects. We can also save computation and object allocations if there is no padding which is a common case in practice.
Modifications:
- Remove DataFrameHeader
- Add a fast path for padding == 0
Result:
Less object allocation in DefaultHttp2FrameWriter