Motivation:
PR #5355 modified interfaces to reduce GC related to the HPACK code. However this came with an anticipated performance regression related to HpackUtil.equals due to AsciiString's increase cost of charAt(..). We should mitigate this performance regression.
Modifications:
- Introduce an equals method in PlatformDependent which doesn't leak timing information and use this in HpcakUtil.equals
Result:
Fixes https://github.com/netty/netty/issues/5436
Motivation:
It is good to have used dependencies and plugins up-to-date to fix any undiscovered bug fixed by the authors.
Modification:
Scanned dependencies and plugins and carefully updated one by one.
Result:
Dependencies and plugins are up-to-date.
Motivations:
The HPACK code was not really optimized and written with Netty types in mind. Because of this a lot of garbage was created due heavy object creation.
This was first reported in [#3597] and https://github.com/grpc/grpc-java/issues/1872 .
Modifications:
- Directly use ByteBuf as input and output
- Make use of ByteProcessor where possible
- Use AsciiString as this is the only thing we need for our http2 usage
Result:
Less garbage and better usage of Netty apis.
Motivation:
99dfc9ea79 introduced some code that will more frequently try to forward messages out of the list of decoded messages to reduce latency and memory footprint. Unfortunally this has the side-effect that RecycleableArrayList.clear() will be called more often and so introduce some overhead as ArrayList will null out the array on each call.
Modifications:
- Introduce a CodecOutputList which allows to not null out the array until we recycle it and also allows to access internal array with extra range checks.
- Add benchmark that add elements to different List implementations and clear them
Result:
Less overhead when decode / encode messages.
Benchmark (elements) Mode Cnt Score Error Units
CodecOutputListBenchmark.arrayList 1 thrpt 20 24853764.609 ± 161582.376 ops/s
CodecOutputListBenchmark.arrayList 4 thrpt 20 17310636.508 ± 930517.403 ops/s
CodecOutputListBenchmark.codecOutList 1 thrpt 20 26670751.661 ± 587812.655 ops/s
CodecOutputListBenchmark.codecOutList 4 thrpt 20 25166421.089 ± 166945.599 ops/s
CodecOutputListBenchmark.recyclableArrayList 1 thrpt 20 24565992.626 ± 210017.290 ops/s
CodecOutputListBenchmark.recyclableArrayList 4 thrpt 20 18477881.775 ± 157003.777 ops/s
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 246.748 sec - in io.netty.handler.codec.CodecOutputListBenchmark
Motivation:
We tried to provide the ability for the user to change the semantics of the threading-model by delegate the invoking of the ChannelHandler to the ChannelHandlerInvoker. Unfortunually this not really worked out quite well and resulted in just more complexity and splitting of code that belongs together. We should remove the ChannelHandlerInvoker again and just do the same as in 4.0
Modifications:
Remove ChannelHandlerInvoker again and replace its usage in Http2MultiplexCodec
Result:
Easier code and less bad abstractions.
Motivation:
KObjectHashMap.probeNext(..) usually involves 2 conditional statements and 2 aritmatic operations. This can be improved to have 0 conditional statements.
Modifications:
- Use bit masking to avoid conditional statements
Result:
Improved performance for KObjecthashMap.probeNext(..)
Motivation:
The DefaultHttp2Conneciton.close method accounts for active streams being iterated and attempts to avoid reentrant modifications of the underlying stream map by using iterators to remove from the stream map. However there are a few issues:
- While iterating over the stream map we don't prevent iterations over the active stream collection
- Removing a single stream may actually remove > 1 streams due to closed non-leaf streams being preserved in the priority tree which may result in NPE
Preserving closed non-leaf streams in the priority tree is no longer necessary with our current allocation algorithms, and so this feature (and related complexity) can be removed.
Modifications:
- DefaultHttp2Connection.close should prevent others from iterating over the active streams and reentrant modification scenarios which may result from this
- DefaultHttp2Connection should not keep closed stream in the priority tree
- Remove all associated code in DefaultHttp2RemoteFlowController which accounts for this case including the ReducedState object
- This includes fixing writability changes which depended on ReducedState
- Update unit tests
Result:
Fixes https://github.com/netty/netty/issues/5198
Motivation:
Some codecs should be considered unstable as these are relative new. For this purpose we should introduce an annotation which these codecs should us to be marked as unstable in terms of API.
Modifications:
- Add UnstableApi annotation and use it on codecs that are not stable
- Move http2.hpack to http2.internal.hpack as it is internal.
Result:
Better document unstable APIs.
Motivation:
Before release 4.1.0.Final we should update all our dependencies.
Modifications:
Update dependencies.
Result:
Up-to-date dependencies used.
Motivation:
The current slow path of FastThreadLocal is much slower than JDK ThreadLocal. See #4418
Modifications:
- Add FastThreadLocalSlowPathBenchmark for the flow path of FastThreadLocal
- Add final to speed up the slow path of FastThreadLocal
Result:
The slow path of FastThreadLocal is improved.
Motivation:
See https://github.com/netty/netty-build/issues/5
Modifications:
Add xml-maven-plugin to check indentation and fix violations
Result:
pom.xml will be checked in the PR build
Motivation:
Currently the initial headers for every stream is queued in the flow controller. Since the initial header frame may create streams the peer must receive these frames in the order in which they were created, or else this will be a protocol error and the connection will be closed. Tolerating the initial headers being queued would increase the complexity of the WeightedFairQueueByteDistributor and there is benefit of doing so is not clear.
Modifications:
- The initial headers will no longer be queued in the flow controllers
Result:
Fixes https://github.com/netty/netty/issues/4758
Motivation:
Being able to access the invoker() is useful when adding additional
handlers that should be running in the same thread. Since an application
may be using a threading model unsupported by the default invoker, they
can specify their own. Because of that, in a handler that auto-adds
other handlers:
// This is a good pattern
ctx.pipeline().addBefore(ctx.invoker(), ctx.name(), null, newHandler);
// This will generally work, but prevents using custom invoker.
ctx.pipeline().addBefore(ctx.executor(), ctx.name(), null, newHandler);
That's why I believe in commit 110745b0, for the now-defunct 5.0 branch,
when ChannelHandlerAppender was added the invoker() method was also
necessary.
There is a side-benefit to exposing the invoker: in certain advanced
use-cases using the invoker for a particular handler is useful. Using
the invoker you are able to invoke a _particular_ handler, from possibly
a different thread yet still using standard exception processing.
ChannelHandlerContext does part of that, but is unwieldy when trying to
invoke a particular handler because it invokes the prev or next handler,
not the one the context is for. A workaround is to use the next or prev
context (respectively), but this breaks when the pipeline changes.
This came up during writing the Http2MultiplexCodec which uses a
separate child channel for each http/2 stream and wants to send messages
from the child channel directly to the Http2MultiplexCodec handler that
created it.
Modifications:
Add the invoker() method to ChannelHandlerContext. It was already being
implemented by AbstractChannelHandlerContext. The two other
implementations of ChannelHandlerContext needed minor tweaks.
Result:
Access to the invoker used for a particular handler, for either reusing
for other handlers or for advanced use-cases. Fixes#4738
Motivation:
Javadoc reports errors about invalid docs.
Modifications:
Fix some errors reported by javadoc.
Result:
A lot of javadoc errors are fixed by this patch.
Motivation:
PriorityStreamByteDistributor has been removed but NoPriorityByteDistributionBenchmark in microbench still need it and causes compile error
Modifications:
Remove PriorityStreamByteDistributor from NoPriorityByteDistributionBenchmark
Result:
The compile error has been fixed
Motivation:
The `NoPriorityByteDistibbutionBenchmark` was broken with a recent commit.
Modifications:
Fixed the benchmark to use the new HTTP2 handler builder.
Result:
It builds.
Motivation:
PriorityStreamByteDistributor uses a homegrown algorithm which distributes bytes to nodes in the priority tree. PriorityStreamByteDistributor has no concept of goodput which may result in poor utilization of network resources. PriorityStreamByteDistributor also has performance issues related to the tree traversal approach and number of nodes that must be visited. There also exists some more proven algorithms from the resource scheduling domain which PriorityStreamByteDistributor does not employ.
Modifications:
- Introduce a new ByteDistributor which uses elements from weighted fair queue schedulers
Result:
StreamByteDistributor which is sensitive to priority and uses a more familiar distribution concept.
Fixes https://github.com/netty/netty/issues/4462
Related: #4572
Motivation:
- A user might want to extend Http2ConnectionHandler and define his/her
own static inner Builder class that extends
Http2ConnectionHandler.BuilderBase. This introduces potential
confusion because there's already Http2ConnectionHandler.Builder. Your
IDE will warn about this name duplication as well.
- BuilderBase exposes all setters with public modifier. A user's Builder
might not want to expose them to enforce it to certain configuration.
There's no way to hide them because it's public already and they are
final.
- BuilderBase.build(Http2ConnectionDecoder, Http2ConnectionEncoder)
ignores most properties exposed by BuilderBase, such as
validateHeaders, frameLogger and encoderEnforceMaxConcurrentStreams.
If any build() method ignores the properties exposed by the builder,
there's something wrong.
- A user's Builder that extends BuilderBase might want to require more
parameters in build(). There's no way to do that cleanly because
build() is public and final already.
Modifications:
- Make BuilderBase and Builder top-level so that there's no duplicate
name issue anymore.
- Add AbstractHttp2ConnectionHandlerBuilder
- Add Http2ConnectionHandlerBuilder
- Add HttpToHttp2ConnectionHandlerBuilder
- Make all builder methods in AbstractHttp2ConnectionHandlerBuilder
protected so that a subclass can choose which methods to expose
- Provide only a single build() method
- Add connection() and codec() so that a user can still specify
Http2Connection or Http2Connection(En|De)coder explicitly
- Implement proper state validation mechanism so that it is prevented
to invoke conflicting setters
Result:
Less confusing yet flexible builder API
Motivation:
ChannelMetadata has a field minMaxMessagesPerRead which can be confusing. There are also some cases where static instances are used and the default value for channel type is not being applied.
Modifications:
- use a default value which is set unconditionally to simplify
- make sure static instances of MaxMessagesRecvByteBufAllocator are not used if the intention is that the default maxMessagesPerRead should be derived from the channel type.
Result:
Less confusing interfaces in ChannelMetadata and ChannelConfig. Default maxMessagesPerRead is correctly applied.
Motivation:
2a2059d976 was backported from master, and included an overriden method which does not exist in 4.1.
Modifications:
- Remove the invoker method from NoPriorityByteDistributionBenchmark
Result:
No more build error
Motivation:
The current priority algorithm can yield poor per-stream goodput when either the number of streams is high or the connection window is small. When all priorities are the same (i.e. priority is disabled), we should be able to do better.
Modifications:
Added a new UniformStreamByteDistributor that ignores priority entirely and manages a queue of streams. Each stream is allocated a minimum of 1KiB on each iteration.
Result:
Improved goodput when priority is not used.
Motivation:
The twitter hpack project does not have the support that it used to have. See discussion here: https://github.com/netty/netty/issues/4403.
Modifications:
Created a new module in Netty and copied the latest from twitter hpack master.
Result:
Netty no longer depends on twitter hpack.
Motivation:
The AsciiString.hashCode() method can be optimized. This method is frequently used while to build the DefaultHeaders data structure.
Modification:
- Add a PlatformDependent hashCode algorithm which utilizes UNSAFE if available
Result:
AsciiString hashCode is faster.
Motivation:
Headers and groups of headers are frequently copied and the current mechanism is slower than it needs to be.
Modifications:
Skip name validation and hash computation when they are not necessary.
Fix emergent bug in CombinedHttpHeaders identified with better testing
Fix memory leak in DefaultHttp2Headers when clearing
Added benchmarks
Result:
Faster header copying and some collateral bug fixes
Motivation:
For many HTTP/2 applications (such as gRPC) it is necessary to autorefill the connection window in order to prevent application-level deadlocking.
Consider an application with 2 streams, A and B. A receives a stream of messages and the application pops off one message at a time and makes a request on stream B. However, if receiving of data on A has caused the connection window to collapse, B will not be able to receive any data and the application will deadlock. The only way (currently) to get around this is 1) use multiple connections, or 2) manually refill the connection window. Both are undesirable and could needlessly complicate the application code.
Modifications:
Add a configuration option to DefaultHttp2LocalFlowController, allowing it to autorefill the connection window.
Result:
Applications can configure HTTP/2 to avoid inter-stream deadlocking.
Motivation:
We should allow our custom Executor to shutdown quickly.
Modifications:
Call super constructor which correct arguments.
Result:
Custom Executor can be shutdown quickly.
Motivation:
The HTTP/2 RFC (https://tools.ietf.org/html/rfc7540#section-8.1.2) indicates that header names consist of ASCII characters. We currently use ByteString to represent HTTP/2 header names. The HTTP/2 RFC (https://tools.ietf.org/html/rfc7540#section-10.3) also eludes to header values inheriting the same validity characteristics as HTTP/1.x. Using AsciiString for the value type of HTTP/2 headers would allow for re-use of predefined HTTP/1.x values, and make comparisons more intuitive. The Headers<T> interface could also be expanded to allow for easier use of header types which do not have the same Key and Value type.
Motivation:
- Change Headers<T> to Headers<K, V>
- Change Http2Headers<ByteString> to Http2Headers<CharSequence, CharSequence>
- Remove ByteString. Having AsciiString extend ByteString complicates equality comparisons when the hash code algorithm is no longer shared.
Result:
Http2Header types are more representative of the HTTP/2 RFC, and relationship between HTTP/2 header name/values more directly relates to HTTP/1.x header names/values.
Motivation:
As reported in #4402, the FastThreadLocalBenchmark shows that the JDK ThreadLocal
is actually faster than Netty's custom thread local implementation.
I was looking forward to doing some deep digging, but got disappointed :(.
Modifications:
The microbenchmark was not using FastThreadLocalThreads and would thus always hit the slow path.
I updated the JMH command line flags, so that FastThreadLocalThreads would be used.
Result:
FastThreadLocalBenchmark shows FastThreadLocal to be faster than JDK's ThreadLocal implementation,
by about 56% in this particular benchmark. Run on OSX El Capitan with OpenJDK 1.8u60.
Benchmark Mode Cnt Score Error Units
FastThreadLocalBenchmark.fastThreadLocal thrpt 20 55452.027 ± 725.713 ops/s
FastThreadLocalBenchmark.jdkThreadLocalGet thrpt 20 35481.888 ± 1471.647 ops/s
Motivation:
To prove one implementation is faster as the other we should have a benchmark.
Modifications:
Add benchmark which benchmarks the unsafe and non-unsafe implementation of HeapByteBuf.
Result:
Able to compare speed of implementations easily.
Motivation:
Modulo operations are slow, we can use bitwise operation to detect if resource leak detection must be done while sampling.
Modifications:
- Ensure the interval is a power of two
- Use bitwise operation for sampling
- Add benchmark.
Result:
Faster sampling.
Motivation:
The build fails on OSX, due to it trying to pull in an epoll specific OSX dependency. See #4409.
Modifications:
* move netty-transport-native-epoll to linux profile
* exclude Http2FrameWriterBenchmark from compiler
* include Http2FrameWriterBenchmark back only in linux profile (please check)
Result:
Build succeeds on OSX.
Motivation:
SlicedByteBuf can be used for any ByteBuf implementations and so can not do any optimizations that could be done
when AbstractByteBuf is sliced.
Modifications:
- Add SlicedAbstractByteBuf that can eliminate range and reference count checks for _get* and _set* methods.
Result:
Faster SlicedByteBuf implementations for AbstractByteBuf sub-classes.
Motivation:
Calling AbstractByteBuf.toString(..., Charset) is used quite frequently by users but produce a lot of GC.
Modification:
- Use a FastThreadLocal to store the CharBuffer that are needed for decoding.
- Use internalNioBuffer(...) when possible
Result:
Less object creation / Less GC
Motiviation:
Checking reference count on every access on a ByteBuf can have some big performance overhead depending on how the access pattern is. If the user is sure that there are no reference count errors on his side it should be possible to disable the check and so gain the max performance.
Modification:
- Add io.netty.buffer.bytebuf.checkAccessible system property which allows to disable the checks. Enabled by default.
- Add microbenchmark
Result:
Increased performance for operations on the ByteBuf.
Motivation:
When dealing with case insensitive headers it can be useful to have a case insensitive contains method for CharSequence.
Modifications:
- Add containsCaseInsensative to AsciiString
Result:
More expressive utility method for case insensitive CharSequence.
Motivation:
Using the builder pattern for Http2ConnectionHandler (and subclasses) would be advantageous for the following reasons:
1. Provides the consistent construction afforded by the builder pattern for 'optional' arguments. Users can specify these options 1 time in the builder and then re-use the builder after this.
2. Enforces that the Http2ConnectionHandler's internals (decoder Http2FrameListener) are initialized after construction.
Modifications:
- Add an extensible builder which can be used to build Http2ConnectionHandler objects
- Update classes which inherit from Http2ConnectionHandler
Result:
It is easier to specify options and construct Http2ConnectionHandler objects.
Motivation:
It is often the case that implementations of Http2FrameListener will want to send responses when data is read. The Http2FrameListener needs access to the Http2ConnectionHandler (or the encoder contained within) to be able to send responses. However the Http2ConnectionHandler requires a Http2FrameListener instance to be passed in during construction time. This creates a cyclic dependency which can make it difficult to cleanly accomplish this relationship.
Modifications:
- Add Http2ConnectionDecoder.frameListener(..) method to set the frame listener. This will allow the listener to be set after construction.
Result:
Classes which inherit from Http2ConnectionHandler can more cleanly set the Http2FrameListener.
Motivation:
For implementations that want to manage flow control down to the stream level it is useful to be notified when stream writability changes.
Modifications:
- Add writabilityChanged to Http2RemoteFlowController.Listener
- Add isWritable to Http2RemoteFlowController
Result:
The Http2RemoteFlowController provides notification when writability of a stream changes.
Motivation:
The latest netty-tcnative fixes a bug in determining the version of the runtime openssl lib. It also publishes an artificact with the classifier linux-<arch>-fedora for fedora-based systems.
Modifications:
Modified the build files to use the "-fedora" classifier when appropriate for tcnative. Care is taken, however, to not change the classifier for the native epoll transport.
Result:
Netty is updated the the new shiny netty-tcnative.
Motivation:
A degradation in performance has been observed from the 4.0 branch as documented in https://github.com/netty/netty/issues/3962.
Modifications:
- Simplify Headers class hierarchy.
- Restore the DefaultHeaders to be based upon DefaultHttpHeaders from 4.0.
- Make various other modifications that are causing hot spots.
Result:
Performance is now on par with 4.0.
Motivation:
We noticed that the headers implementation in Netty for HTTP/2 uses quite a lot of memory
and that also at least the performance of randomly accessing a header is quite poor. The main
concern however was memory usage, as profiling has shown that a DefaultHttp2Headers
not only use a lot of memory it also wastes a lot due to the underlying hashmaps having
to be resized potentially several times as new headers are being inserted.
This is tracked as issue #3600.
Modifications:
We redesigned the DefaultHeaders to simply take a Map object in its constructor and
reimplemented the class using only the Map primitives. That way the implementation
is very concise and hopefully easy to understand and it allows each concrete headers
implementation to provide its own map or to even use a different headers implementation
for processing requests and writing responses i.e. incoming headers need to provide
fast random access while outgoing headers need fast insertion and fast iteration. The
new implementation can support this with hardly any code changes. It also comes
with the advantage that if the Netty project decides to add a third party collections library
as a dependency, one can simply plug in one of those very fast and memory efficient map
implementations and get faster and smaller headers for free.
For now, we are using the JDK's TreeMap for HTTP and HTTP/2 default headers.
Result:
- Significantly fewer lines of code in the implementation. While the total commit is still
roughly 400 lines less, the actual implementation is a lot less. I just added some more
tests and microbenchmarks.
- Overall performance is up. The current implementation should be significantly faster
for insertion and retrieval. However, it is slower when it comes to iteration. There is simply
no way a TreeMap can have the same iteration performance as a linked list (as used in the
current headers implementation). That's totally fine though, because when looking at the
benchmark results @ejona86 pointed out that the performance of the headers is completely
dominated by insertion, that is insertion is so significantly faster in the new implementation
that it does make up for several times the iteration speed. You can't iterate what you haven't
inserted. I am demonstrating that in this spreadsheet [1]. (Actually, iteration performance is
only down for HTTP, it's significantly improved for HTTP/2).
- Memory is down. The implementation with TreeMap uses on avg ~30% less memory. It also does not
produce any garbage while being resized. In load tests for GRPC we have seen a memory reduction
of up to 1.2KB per RPC. I summarized the memory improvements in this spreadsheet [1]. The data
was generated by [2] using JOL.
- While it was my original intend to only improve the memory usage for HTTP/2, it should be similarly
improved for HTTP, SPDY and STOMP as they all share a common implementation.
[1] https://docs.google.com/spreadsheets/d/1ck3RQklyzEcCLlyJoqDXPCWRGVUuS-ArZf0etSXLVDQ/edit#gid=0
[2] https://gist.github.com/buchgr/4458a8bdb51dd58c82b4
Motivation:
The HttpObjectDecoder is on the hot code path for the http codec. There are a few hot methods which can be modified to improve performance.
Modifications:
- Modify AppendableCharSequence to provide unsafe methods which don't need to re-check bounds for every call.
- Update HttpObjectDecoder methods to take advantage of new AppendableCharSequence methods.
Result:
Peformance boost for decoding http objects.
Motivation:
See #3783
Modifications:
- The DefaultHttp2RemoteFlowController should use Channel.isWritable() before attempting to do any write operations.
- The Flow controller methods should no longer take ChannelHandlerContext. The concept of flow control is tied to a connection and we do not support 1 flow controller keeping track of multiple ChannelHandlerContext.
Result:
Writes are delayed until isWritable() is true. Flow controller interface methods are more clear as to ChannelHandlerContext restrictions.
Motivation:
Coalescing many small writes into a larger DATA frame reduces framing overheads on the wire and reduces the number of calls to Http2FrameListeners on the remote side.
Delaying the write of WINDOW_UPDATE until flush allows for more consumed bytes to be returned as the aggregate of consumed bytes is returned and not the amount consumed when the threshold was crossed.
Modifications:
- Remote flow controller no longer immediately writes bytes when a flow-controlled payload is enqueued. Sequential data payloads are now merged into a single CompositeByteBuf which are written when 'writePendingBytes' is called.
- Listener added to remote flow-controller which observes written bytes per stream.
- Local flow-controller no longer immediately writes WINDOW_UPDATE when the ratio threshold is crossed. Now an explicit call to 'writeWindowUpdates' triggers the WINDOW_UPDATE for all streams who's ratio is exceeded at that time. This results in
fewer window updates being sent and more bytes being returned.
- Http2ConnectionHandler.flush triggers 'writeWindowUpdates' on the local flow-controller followed by 'writePendingBytes' on the remote flow-controller so WINDOW_UPDATES preceed DATA frames on the wire.
Result:
- Better throughput for writing many small DATA chunks followed by a flush, saving 9-bytes per coalesced frame.
- Fewer WINDOW_UPDATES being written and more flow-control bytes returned to remote side more quickly, thereby improving throughput.
Motivation:
The ByteString class currently assumes the underlying array will be a complete representation of data. This is limiting as it does not allow a subsection of another array to be used. The forces copy operations to take place to compensate for the lack of API support.
Modifications:
- add arrayOffset method to ByteString
- modify all ByteString and AsciiString methods that loop over or index into the underlying array to use this offset
- update all code that uses ByteString.array to ensure it accounts for the offset
- add unit tests to test the implementation respects the offset
Result:
ByteString and AsciiString can represent a sub region of a byte[].
Motivation:
Streams currently maintain a hash map of user-defined properties, which has been shown to add significant memory overhead as well as being a performance bottleneck for lookup of frequently used properties.
Modifications:
Modifying the connection/stream to use an array as the storage of user-defined properties, indexed by the class that identifies the index into the array where the property is stored.
Result:
Stream processing performance should be improved.
Motivation:
Flow control is a required part of the HTTP/2 specification but it is currently structured more like an optional item. It must be accessed through the property map which is time consuming and does not represent its required nature. This access pattern does not give any insight into flow control outside of the codec (or flow controller implementation).
Modifications:
1. Create a read only public interface for LocalFlowState and RemoteFlowState.
2. Add a LocalFlowState localFlowState(); and RemoteFlowState remoteFlowState(); to Http2Stream.
Result:
Flow control is not part of the Http2Stream interface. This clarifies its responsibility and logical relationship to other interfaces. The flow controller no longer must be acquired though a map lookup.
Motivation:
There is no benchmark to measure the priority tree implementation performance.
Modifications:
Introduce a new benchmark which will populate the priority tree, and then shuffle parent/child links around.
Result:
A simple benchmark to get a baseline for the HTTP/2 codec's priority tree implementation.
Motivation:
Allows for running benchmarks from built jars which is useful in development environments that only take released artifacts.
Modifications:
Move benchmarks into 'main' from 'test'
Add @State annotations to benchmarks that are missing them
Fix timing issue grabbing context during channel initialization
Result:
Users can run benchmarks more easily.
Motivation:
The current implementation does byte by byte comparison, which we have seen
can be a performance bottleneck when the AsciiString is used as the key in
a Map.
Modifications:
Use sun.misc.Unsafe (on supporting platforms) to compare up to eight bytes at a time
and get closer to the performance of String.equals(Object).
Result:
Significant improvement (2x - 6x) in performance over the current implementation.
Benchmark (size) Mode Samples Score Score error Units
i.n.m.i.PlatformDependentBenchmark.arraysBytesEqual 10 thrpt 10 118843477.518 2347259.347 ops/s
i.n.m.i.PlatformDependentBenchmark.arraysBytesEqual 50 thrpt 10 43910319.773 198376.996 ops/s
i.n.m.i.PlatformDependentBenchmark.arraysBytesEqual 100 thrpt 10 26339969.001 159599.252 ops/s
i.n.m.i.PlatformDependentBenchmark.arraysBytesEqual 1000 thrpt 10 2873119.030 20779.056 ops/s
i.n.m.i.PlatformDependentBenchmark.arraysBytesEqual 10000 thrpt 10 306370.450 1933.303 ops/s
i.n.m.i.PlatformDependentBenchmark.arraysBytesEqual 100000 thrpt 10 25750.415 108.391 ops/s
i.n.m.i.PlatformDependentBenchmark.unsafeBytesEqual 10 thrpt 10 248077563.510 635320.093 ops/s
i.n.m.i.PlatformDependentBenchmark.unsafeBytesEqual 50 thrpt 10 128198943.138 614827.548 ops/s
i.n.m.i.PlatformDependentBenchmark.unsafeBytesEqual 100 thrpt 10 86195621.349 1063959.307 ops/s
i.n.m.i.PlatformDependentBenchmark.unsafeBytesEqual 1000 thrpt 10 16920264.598 61615.365 ops/s
i.n.m.i.PlatformDependentBenchmark.unsafeBytesEqual 10000 thrpt 10 1687454.747 6367.602 ops/s
i.n.m.i.PlatformDependentBenchmark.unsafeBytesEqual 100000 thrpt 10 153717.851 586.916 ops/s
Motivation:
The usage and code within AsciiString has exceeded the original design scope for this class. Its usage as a binary string is confusing and on the verge of violating interface assumptions in some spots.
Modifications:
- ByteString will be created as a base class to AsciiString. All of the generic byte handling processing will live in ByteString and all the special character encoding will live in AsciiString.
Results:
The AsciiString interface will be clarified. Users of AsciiString can now be clear of the limitations the class imposes while users of the ByteString class don't have to live with those limitations.
Motivation:
The Http2FrameWriterBenchmark JMH harness class name was not updated for the JVM arguments. The number of forks is 0 which means the JHM will share a JVM with the benchmarks. Sharing the JVM may lead to less reliable benchmarking results and as doesn't allow for the command line arguments to be applied for each benchmark.
Modifications:
- Update the JMH version from 0.9 to 1.7.1. Benchmarks wouldn't run on old version.
- Increase the number of forks from 0 to 1.
- Remove allocation of environment from static and cleanup AfterClass to using the Setup and Teardown methods. The forked JVM would not shut down correctly otherwise (and wait for 30+ seconds before timeing out).
Result:
Benchmarks that run as intended.
Motivation:
It currently takes a builder for the encoder and decoder, which makes it difficult to decorate them.
Modifications:
Removed the builders from the interfaces entirely. Left the builder for the decoder impl but removed it from the encoder since it's constructor only takes 2 parameters. Also added decorator base classes for the encoder and decoder and made the CompressorHttp2ConnectionEncoder extend the decorator.
Result:
Fixes#3530
Motivation:
The backport of a6c729bdf8 failed.
Modifications:
- Make sure the interfaces are correctly implemented when backporting.
Result:
Microbenchmark compiles and runs on 4.1 branch.
Motivation:
A microbenchmark will be useful to get a baseline for performance.
Modifications:
- Introduce a new microbenchmark which tests the Http2DefaultFrameWriter.
- Allow benchmarks to run without thread context switching between JMH and Netty.
Result:
Microbenchmark exists to test performance.
Motivation
----------
The performance tests for utf8 also used the getBytes on ASCII,
which is incorrect and also provides different performance numbers.
Modifications
-------------
Use CharsetUtil.UTF_8 instead of US_ASCII for the getBytes calls.
Result
------
Accurate and semantically correct benchmarking results on utf8
comparisons.
Motivation:
We expose no methods in ByteBuf to directly write a CharSequence into it. This leads to have the user either convert the CharSequence first to a byte array or use CharsetEncoder. Both cases have some overheads and we can do a lot better for well known Charsets like UTF-8 and ASCII.
Modifications:
Add ByteBufUtil.writeAscii(...) and ByteBufUtil.writeUtf8(...) which can do the task in an optimized way. This is especially true if the passed in ByteBuf extends AbstractByteBuf which is true for all of our implementations which not wrap another ByteBuf.
Result:
Writing an ASCII and UTF-8 CharSequence into a AbstractByteBuf is a lot faster then what the user could do by himself as we can make use of some package private methods and so eliminate reference and range checks. When the Charseq is not ASCII or UTF-8 we can still do a very good job and are on par in most of the cases with what the user would do.
The following benchmark shows the improvements:
Result: 2456866.966 ?(99.9%) 59066.370 ops/s [Average]
Statistics: (min, avg, max) = (2297025.189, 2456866.966, 2586003.225), stdev = 78851.914
Confidence interval (99.9%): [2397800.596, 2515933.336]
Benchmark Mode Samples Score Score error Units
i.n.m.b.ByteBufUtilBenchmark.writeAscii thrpt 50 9398165.238 131503.098 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiString thrpt 50 9695177.968 176684.821 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiStringViaArray thrpt 50 4788597.415 83181.549 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiStringViaArrayWrapped thrpt 50 4722297.435 98984.491 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiStringWrapped thrpt 50 4028689.762 66192.505 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiViaArray thrpt 50 3234841.565 91308.009 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiViaArrayWrapped thrpt 50 3311387.474 39018.933 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeAsciiWrapped thrpt 50 3379764.250 66735.415 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8 thrpt 50 5671116.821 101760.081 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8String thrpt 50 5682733.440 111874.084 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8StringViaArray thrpt 50 3564548.995 55709.512 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8StringViaArrayWrapped thrpt 50 3621053.671 47632.820 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8StringWrapped thrpt 50 2634029.071 52304.876 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8ViaArray thrpt 50 3397049.332 57784.119 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8ViaArrayWrapped thrpt 50 3318685.262 35869.562 ops/s
i.n.m.b.ByteBufUtilBenchmark.writeUtf8Wrapped thrpt 50 2473791.249 46423.114 ops/s
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,387.417 sec - in io.netty.microbench.buffer.ByteBufUtilBenchmark
Results :
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
Results :
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
The *ViaArray* benchmarks are basically doing a toString().getBytes(Charset) which the others are using ByteBufUtil.write*(...).
Motivation:
The java implementations for Inet6Address.getHostName() do not follow the RFC 5952 (http://tools.ietf.org/html/rfc5952#section-4) for recommended string representation. This introduces inconsistencies when integrating with other technologies that do follow the RFC.
Modifications:
-NetUtil.java to have another public static method to convert InetAddress to string. Inet4Address will use the java InetAddress.getHostAddress() implementation and there will be new code to implement the RFC 5952 IPV6 string conversion.
-New unit tests to test the new method
Result:
Netty provides a RFC 5952 compliant string conversion method for IPV6 addresses
Motivation:
default*() tests are performing a test in a different way, and they must be same with other tests.
Modification:
Make sure default*() tests are same with the others
Result:
Easier to compare default and non-default allocators
Motivation:
When Netty runs in a managed environment such as web application server,
Netty needs to provide an explicit way to remove the thread-local
variables it created to prevent class loader leaks.
FastThreadLocal uses different execution paths for storing a
thread-local variable depending on the type of the current thread.
It increases the complexity of thread-local removal.
Modifications:
- Moved FastThreadLocal and FastThreadLocalThread out of the internal
package so that a user can use it.
- FastThreadLocal now keeps track of all thread local variables it has
initialized, and calling FastThreadLocal.removeAll() will remove all
thread-local variables of the caller thread.
- Added FastThreadLocal.size() for diagnostics and tests
- Introduce InternalThreadLocalMap which is a mixture of hard-wired
thread local variable fields and extensible indexed variables
- FastThreadLocal now uses InternalThreadLocalMap to implement a
thread-local variable.
- Added ThreadDeathWatcher.unwatch() so that PooledByteBufAllocator
tells it to stop watching when its thread-local cache has been freed
by FastThreadLocal.removeAll().
- Added FastThreadLocalTest to ensure that removeAll() works
- Added microbenchmark for FastThreadLocal and JDK ThreadLocal
- Upgraded to JMH 0.9
Result:
- A user can remove all thread-local variables Netty created, as long as
he or she did not exit from the current thread. (Note that there's no
way to remove a thread-local variable from outside of the thread.)
- FastThreadLocal exposes more useful operations such as isSet() because
we always implement a thread local variable via InternalThreadLocalMap
instead of falling back to JDK ThreadLocal.
- FastThreadLocalBenchmark shows that this change improves the
performance of FastThreadLocal even more.
Motivation:
Provide a faster ThreadLocal implementation
Modification:
Add a "FastThreadLocal" which uses an EnumMap and a predefined fixed set of possible thread locals (all of the static instances created by netty) that is around 10-20% faster than standard ThreadLocal in my benchmarks (and can be seen having an effect in the direct PooledByteBufAllocator benchmark that uses the DEFAULT ByteBufAllocator which uses this FastThreadLocal, as opposed to normal instantiations that do not, and in the new RecyclableArrayList benchmark);
Result:
Improved performance
Motivation:
Our Unsafe*ByteBuf implementation always invert bytes when the native ByteOrder is LITTLE_ENDIAN (this is true on intel), even when the user calls order(ByteOrder.LITTLE_ENDIAN). This is not optimal for performance reasons as the user should be able to set the ByteOrder to LITTLE_ENDIAN and so write bytes without the extra inverting.
Modification:
- Introduce a new special SwappedByteBuf (called UnsafeDirectSwappedByteBuf) that is used by all the Unsafe*ByteBuf implementation and allows to write without inverting the bytes.
- Add benchmark
- Upgrade jmh to 0.8
Result:
The user is be able to get the max performance even on servers that have ByteOrder.LITTLE_ENDIAN as their native ByteOrder.
Motivation:
Allocating a single buffer and releasing it repetitively for a benchmark will not involve the realistic execution path of the allocators.
Modifications:
Keep the last 8192 allocations and release them randomly.
Result:
We are now getting the result close to what we got with caliper.
The API changes made so far turned out to increase the memory footprint
and consumption while our intention was actually decreasing them.
Memory consumption issue:
When there are many connections which does not exchange data frequently,
the old Netty 4 API spent a lot more memory than 3 because it always
allocates per-handler buffer for each connection unless otherwise
explicitly stated by a user. In a usual real world load, a client
doesn't always send requests without pausing, so the idea of having a
buffer whose life cycle if bound to the life cycle of a connection
didn't work as expected.
Memory footprint issue:
The old Netty 4 API decreased overall memory footprint by a great deal
in many cases. It was mainly because the old Netty 4 API did not
allocate a new buffer and event object for each read. Instead, it
created a new buffer for each handler in a pipeline. This works pretty
well as long as the number of handlers in a pipeline is only a few.
However, for a highly modular application with many handlers which
handles connections which lasts for relatively short period, it actually
makes the memory footprint issue much worse.
Changes:
All in all, this is about retaining all the good changes we made in 4 so
far such as better thread model and going back to the way how we dealt
with message events in 3.
To fix the memory consumption/footprint issue mentioned above, we made a
hard decision to break the backward compatibility again with the
following changes:
- Remove MessageBuf
- Merge Buf into ByteBuf
- Merge ChannelInboundByte/MessageHandler and ChannelStateHandler into ChannelInboundHandler
- Similar changes were made to the adapter classes
- Merge ChannelOutboundByte/MessageHandler and ChannelOperationHandler into ChannelOutboundHandler
- Similar changes were made to the adapter classes
- Introduce MessageList which is similar to `MessageEvent` in Netty 3
- Replace inboundBufferUpdated(ctx) with messageReceived(ctx, MessageList)
- Replace flush(ctx, promise) with write(ctx, MessageList, promise)
- Remove ByteToByteEncoder/Decoder/Codec
- Replaced by MessageToByteEncoder<ByteBuf>, ByteToMessageDecoder<ByteBuf>, and ByteMessageCodec<ByteBuf>
- Merge EmbeddedByteChannel and EmbeddedMessageChannel into EmbeddedChannel
- Add SimpleChannelInboundHandler which is sometimes more useful than
ChannelInboundHandlerAdapter
- Bring back Channel.isWritable() from Netty 3
- Add ChannelInboundHandler.channelWritabilityChanges() event
- Add RecvByteBufAllocator configuration property
- Similar to ReceiveBufferSizePredictor in Netty 3
- Some existing configuration properties such as
DatagramChannelConfig.receivePacketSize is gone now.
- Remove suspend/resumeIntermediaryDeallocation() in ByteBuf
This change would have been impossible without @normanmaurer's help. He
fixed, ported, and improved many parts of the changes.
- Rename directbyDefault to preferDirect
- Add a system property 'io.netty.prederDirect' to allow a user from changing the preference on launch-time
- Merge UnpooledByteBufAllocator.DEFAULT_BY_* to DEFAULT
- Fixes#826
Unsafe.isFreed(), free(), suspend/resumeIntermediaryAllocations() are not that dangerous. internalNioBuffer() and internalNioBuffers() are dangerous but it seems like nobody is using it even inside Netty. Removing those two methods also removes the necessity to keep Unsafe interface at all.
- Add common optimization options when launching a new JVM to run a benchmark
- Fix a bug where a benchmark report is uploaded twice
- Simplify pom.xml and move the build instruction messages to DefaultBenchmark
- Print an empty line to prettify the output
This pull request introduces the new default ByteBufAllocator implementation based on jemalloc, with a some differences:
* Minimum possible buffer capacity is 16 (jemalloc: 2)
* Uses binary heap with random branching (jemalloc: red-black tree)
* No thread-local cache yet (jemalloc has thread-local cache)
* Default page size is 8 KiB (jemalloc: 4 KiB)
* Default chunk size is 16 MiB (jemalloc: 2 MiB)
* Cannot allocate a buffer bigger than the chunk size (jemalloc: possible) because we don't have control over memory layout in Java. A user can work around this issue by creating a composite buffer, but it's not always a feasible option. Although 16 MiB is a pretty big default, a user's handler might need to deal with the bounded buffers when the user wants to deal with a large message.
Also, to ensure the new allocator performs good enough, I wrote a microbenchmark for it and made it a dedicated Maven module. It uses Google's Caliper framework to run and publish the test result (example)
Miscellaneous changes:
* Made some ByteBuf implementations public so that those who implements a new allocator can make use of them.
* Added ByteBufAllocator.compositeBuffer() and its variants.
* ByteBufAllocator.ioBuffer() creates a buffer with 0 capacity.