Motivation:
CompositeByteBuf is a powerful and versatile abstraction, allowing for
manipulation of large data without copying bytes. There is still a
non-negligible cost to reading/writing however relative to "singular"
ByteBufs, and this can be mostly eliminated with some rework of the
internals.
My use case is message modification/transformation while zero-copy
proxying. For example replacing a string within a large message with one
of a different length
Modifications:
- No longer slice added buffers and unwrap added slices
- Components store target buf offset relative to position in
composite buf
- Less allocations, object footprint, pointer indirection, offset
arithmetic
- Use Component[] rather than ArrayList<Component>
- Avoid pointer indirection and duplicate bounds check, more
efficient backing array growth
- Facilitates optimization when doing bulk-inserts - inserting n
ByteBufs behind m is now O(m + n) instead of O(mn)
- Avoid unnecessary casting and method call indirection via superclass
- Eliminate some duplicate range/ref checks via non-checking versions of
toComponentIndex and findComponent
- Add simple fast-path for toComponentIndex(0); add racy cache of
last-accessed Component to findComponent(int)
- Override forEachByte0(...) and forEachByteDesc0(...) methods
- Make use of RecyclableArrayList in nioBuffers(int, int) (in line with
FasterCompositeByteBuf impl)
- Modify addComponents0(boolean,int,Iterable) to use the Iterable
directly rather than copy to an array first (and possibly to an
ArrayList before that)
- Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform
repeated array insertions and avoid second loop for offset updates
- Simplify other logic in various places, in particular the general
pattern used where a sub-range is iterated over
- Add benchmarks to demonstrate some improvements
While refactoring I also came across a couple of clear bugs. They are
fixed in these changes but I will open another PR with unit tests and
fixes to the current version.
Result:
Much faster creation, manipulation, and access; many fewer allocations
and smaller footprint. Benchmark results to follow.
* Optimize AbstractByteBuf.getCharSequence() in US_ASCII case
Motivation:
Inspired by https://github.com/netty/netty/pull/8388, I noticed this
simple optimization to avoid char[] allocation (also suggested in a TODO
here).
Modifications:
Return an AsciiString from AbstractByteBuf.getCharSequence() if
requested charset is US_ASCII or ISO_8859_1 (latter thanks to
@Scottmitch's suggestion). Also tweak unit tests not to require Strings
and include a new benchmark to demonstrate the speedup.
Result:
Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1
cases
Motiviation:
At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives.
It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here.
The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check.
Modifications:
- Try to do a non-volatile read using sun.misc.Unsafe if we can use it.
- Add a benchmark
Result:
Big performance win when multiple ByteBuf methods are called from a method.
With the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 281395842,128 ± 5050792,296 ops/s
Before the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 217419832,801 ± 5080579,030 ops/s
Motivation:
We should use the latest jmh version which also supports -prof dtraceasm on MacOS.
Modifications:
Update to latest jmh version.
Result:
Better benchmark / profiling support on MacOS.
Motivation:
Some of the flags we used are not supported anymore on more recent JDK versions. We should just remove all of them and only keep what we really need. This may also reflect better what people use in production.
Modifications:
Remove some flags when running the benchmarks.
Result:
Benchmarks also run with JDK11.
Motivation:
It is sometimes useful to be able to run benchmarks easily from the commandline and passs different arguments / options here. We should support this.
Modifications:
Add the benchmark-jar profile which allows to generate such an "uber-jar" that can be used directly to run benchmarks as documented at http://openjdk.java.net/projects/code-tools/jmh/.
Result:
More flexible way to run benchmarks.
Motivation:
Optimizing the Epoll channel needs an objective measure of how fast
it is.
Modification:
Add a simple, closed loop, ping-pong benchmark.
Result:
Benchmark can be used to measure #7816
Initial numbers:
```
Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong":
22614.403 ±(99.9%) 797.263 ops/s [Average]
(min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130
CI (99.9%): [21817.140, 23411.666] (assumes normal distribution)
Benchmark Mode Cnt Score Error Units
EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s
```
Motivation:
The JVM isn't always able to hoist out/reduce bounds checking (due to ref counting operations etc etc) hence making it configurable could improve performances for most CPU intensive use cases.
Modifications:
Each AbstractByteBuf bounds check has been tested against a new static final configuration property similar to checkAccessible ie io.netty.buffer.bytebuf.checkBounds.
Result:
Any user could disable ByteBuf bounds checking in order to get extra performances.
Motivation:
The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant overhead
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).
In the case of CodecInputListBenchmark, benchmarks are far too small (less than 50ns) and the Invocation
level setup offsets the measurement considerably.
On such cases, the recommended fix patch is to include the setup/teardown code in the benchmark method.
Modifications:
Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown methods from the benchmark class.
Result:
We run the entire benchmark 10 times with default parameters we observed:
- ArrayList benchmark affected directly by JMH overhead is now from 15-80% faster.
- CodecList benchmark is now 50% faster than original (even with the setup code being measured).
- Recyclable ArrayList is ~30% slower.
- All benchmarks have significant different means (ANOVA) and medians (Moore)
Mode: Throughput (Higher the better)
Method Full params Factor Modified (Median) Original (Median)
recyclableArrayList (elements = 1) 0.615520967 21719082.75 35285691.2
recyclableArrayList (elements = 4) 0.699553431 17149442.76 24514843.31
arrayList (elements = 4) 1.152666631 27120407.18 23528404.88
codecOutList (elements = 1) 1.527275908 67251089.04 44033359.47
codecOutList (elements = 4) 1.596917095 59174088.78 37055204.03
arrayList (elements = 1) 1.878616889 62188238.24 33103204.06
Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.
Motivation:
The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant impact in
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).
When the benchmark and the setup/teardown is too small (less than a milisecond) the Invocation level might saturate the system with
timestamp requests and iteration synchronizations which introduce artificial latency, throughput, and scalability bottlenecks.
In the HeadersBenchmark, all benchmarks take less than 100ns and the Invocation level setup offsets the measurement considerably.
As fixture methods is defined for the entire class, this overhead also impacts every single benchmark in this class, not only
the ones that use the emptyHttpHeaders object (cleaned in the setup).
The recommended fix patch here is to include the setup/teardown code in the benchmark where the object is used.
Modifications:
Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown method of Invocation level from the benchmark class.
Result:
We run all benchmarks from HeadersBenchmark 10 times with default parameter, we observe:
- Benchmarks that were not directly affected by the fix patch, improved execution time.
For instance, http2Remove with (exampleHeader = THREE) had its median reported as 2x faster than the original version.
- Benchmarks that had the setup code inserted (eg. http2AddAllFastest) did not suffer a significant punch in the execution time,
as the benchmarks are not dominated by the clear().
Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.
Motivation:
NetUtilBenchmark is using out of date data, throws an exception in the benchmark, and allocates a Set on each run.
Modifications:
- Update the benchmark and reduce each run's overhead
Result:
NetUtilBenchmark is updated.
Motivation:
Read-only heap ByteBuffer doesn't expose array: the existent method to perform copies to direct ByteBuf involves the creation of a (maybe pooled) additional heap ByteBuf instance and copy
Modifications:
To avoid stressing the allocator with additional (and stealth) heap ByteBuf allocations is provided a method to perform copies using the (pooled) internal NIO buffer
Result:
Copies from read-only heap ByteBuffer to direct ByteBuf won't create any intermediate ByteBuf
Motivation:
According to the spec:
All pseudo-header fields MUST appear in the header block before regular
header fields. Any request or response that contains a pseudo-header
field that appears in a header block after
a regular header field MUST be treated as malformed (Section 8.1.2.6).
Pseudo-header fields are only valid in the context in which they are defined.
Pseudo-header fields defined for requests MUST NOT appear in responses;
pseudo-header fields defined for responses MUST NOT appear in requests.
Pseudo-header fields MUST NOT appear in trailers.
Endpoints MUST treat a request or response that contains undefined or
invalid pseudo-header fields as malformed (Section 8.1.2.6).
Clients MUST NOT accept a malformed response. Note that these requirements
are intended to protect against several types of common attacks against HTTP;
they are deliberately strict because being permissive can expose
implementations to these vulnerabilities.
Modifications:
- Introduce validation in HPackDecoder
Result:
- Requests with unknown pseudo-field headers are rejected
- Requests with containing response specific pseudo-headers are rejected
- Requests where pseudo-header appear after regular header are rejected
- h2spec 8.1.2.1 pass
Motivation:
We used Recycler for the CodecOutputList which is not optimized for the use-case of access only from the same Thread all the time.
Modifications:
- Use FastThreadLocal for CodecOutputList
- Add benchmark
Result:
Less overhead in our codecs.
Motivation:
The JMH doc suggests to use BlackHoles to avoid dead code elimination hence would be better to follow this best practice.
Modifications:
Each benchmark method is returning the ByteBuf/ByteBuffer to avoid the JVM to perform any dead code elimination.
Result:
The results are more reliable and comparable to the others provided by other ByteBuf benchmarks (eg HeapByteBufBenchmark)
Motivation:
HttpMethod#valueOf shows up on profiler results in the top set of
results. Since it is a relatively simple operation it can be improved in
isolation.
Modifications:
- Introduce a special case map which assigns each HttpMethod to a unique
index in an array and provides constant time lookup from a hash code
algorithm. When the bucket is matched we can then directly do equality
comparison instead of potentially following a linked structure when
HashMap has hash collisions.
Result:
~10% improvement in benchmark results for HttpMethod#valueOf
Benchmark Mode Cnt Score Error Units
HttpMethodMapBenchmark.newMapKnownMethods thrpt 16 31.831 ± 0.928 ops/us
HttpMethodMapBenchmark.newMapMixMethods thrpt 16 25.568 ± 0.400 ops/us
HttpMethodMapBenchmark.newMapUnknownMethods thrpt 16 51.413 ± 1.824 ops/us
HttpMethodMapBenchmark.oldMapKnownMethods thrpt 16 29.226 ± 0.330 ops/us
HttpMethodMapBenchmark.oldMapMixMethods thrpt 16 21.073 ± 0.247 ops/us
HttpMethodMapBenchmark.oldMapUnknownMethods thrpt 16 49.081 ± 0.577 ops/us
Motivation:
DefaultHttp2FrameWriter#writeData allocates a DataFrameHeader for each write operation. DataFrameHeader maintains internal state and allocates multiple slices of a buffer which is a maximum of 30 bytes. This 30 byte buffer may not always be necessary and the additional slice operations can utilize retainedSlice to take advantage of pooled objects. We can also save computation and object allocations if there is no padding which is a common case in practice.
Modifications:
- Remove DataFrameHeader
- Add a fast path for padding == 0
Result:
Less object allocation in DefaultHttp2FrameWriter
Motivation:
`AbstractScheduledEventExecutor` uses a standard `java.util.PriorityQueue` to keep track of task deadlines. `ScheduledFuture.cancel` removes tasks from this `PriorityQueue`. Unfortunately, `PriorityQueue.remove` has `O(n)` performance since it must search for the item in the entire queue before removing it. This is fast when the future is at the front of the queue (e.g., already triggered) but not when it's randomly located in the queue.
Many servers will use `ScheduledFuture.cancel` on all requests, e.g., to manage a request timeout. As these cancellations will be happen in arbitrary order, when there are many scheduled futures, `PriorityQueue.remove` is a bottleneck and greatly hurts performance with many concurrent requests (>10K).
Modification:
Use netty's `DefaultPriorityQueue` for scheduling futures instead of the JDK. `DefaultPriorityQueue` is almost identical to the JDK version except it is able to remove futures without searching for them in the queue. This means `DefaultPriorityQueue.remove` has `O(log n)` performance.
Result:
Before - cancelling futures has varying performance, capped at `O(n)`
After - cancelling futures has stable performance, capped at `O(log n)`
Benchmark results
After - cancelling in order and in reverse order have similar performance within `O(log n)` bounds
```
Benchmark (num) Mode Cnt Score Error Units
ScheduledFutureTaskBenchmark.cancelInOrder 100 thrpt 20 137779.616 ± 7709.751 ops/s
ScheduledFutureTaskBenchmark.cancelInOrder 1000 thrpt 20 11049.448 ± 385.832 ops/s
ScheduledFutureTaskBenchmark.cancelInOrder 10000 thrpt 20 943.294 ± 12.391 ops/s
ScheduledFutureTaskBenchmark.cancelInOrder 100000 thrpt 20 64.210 ± 1.824 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 100 thrpt 20 167531.096 ± 9187.865 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 1000 thrpt 20 33019.786 ± 4737.770 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 10000 thrpt 20 2976.955 ± 248.555 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 100000 thrpt 20 362.654 ± 45.716 ops/s
```
Before - cancelling in order and in reverse order have significantly different performance at higher queue size, orders of magnitude worse than the new implementation.
```
Benchmark (num) Mode Cnt Score Error Units
ScheduledFutureTaskBenchmark.cancelInOrder 100 thrpt 20 139968.586 ± 12951.333 ops/s
ScheduledFutureTaskBenchmark.cancelInOrder 1000 thrpt 20 12274.420 ± 337.800 ops/s
ScheduledFutureTaskBenchmark.cancelInOrder 10000 thrpt 20 958.168 ± 15.350 ops/s
ScheduledFutureTaskBenchmark.cancelInOrder 100000 thrpt 20 53.381 ± 13.981 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 100 thrpt 20 123918.829 ± 3642.517 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 1000 thrpt 20 5099.810 ± 206.992 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 10000 thrpt 20 72.335 ± 0.443 ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder 100000 thrpt 20 0.743 ± 0.003 ops/s
```