netty5

Author	SHA1	Message	Date
Nick Hill	583d838f7c	Optimize AbstractByteBuf.getCharSequence() in US_ASCII case (#8392 ) * Optimize AbstractByteBuf.getCharSequence() in US_ASCII case Motivation: Inspired by https://github.com/netty/netty/pull/8388, I noticed this simple optimization to avoid char[] allocation (also suggested in a TODO here). Modifications: Return an AsciiString from AbstractByteBuf.getCharSequence() if requested charset is US_ASCII or ISO_8859_1 (latter thanks to @Scottmitch's suggestion). Also tweak unit tests not to require Strings and include a new benchmark to demonstrate the speedup. Result: Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1 cases	2018-10-26 15:32:38 -07:00
Norman Maurer	87ec2f882a	Reduce overhead by ByteBufUtil.decodeString(...) which is used by `AbstractByteBuf.toString(...)` and `AbstractByteBuf.getCharSequence(...)` (#8388 ) Motivation: Our current implementation that is used for toString(Charset) operations on AbstractByteBuf implementation is quite slow as it does a lot of uncessary memory copies. We should just use new String(...) as it has a lot of optimizations to handle these cases. Modifications: Rewrite ByteBufUtil.decodeString(...) to use new String(...) Result: Less overhead for toString(Charset) operations. Benchmark (charsetName) (direct) (size) Mode Cnt Score Error Units ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII false 8 thrpt 20 22401645.093 ? 4671452.479 ops/s ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII false 64 thrpt 20 23678483.384 ? 3749164.446 ops/s ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII true 8 thrpt 20 15731142.651 ? 3782931.591 ops/s ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII true 64 thrpt 20 16244232.229 ? 1886259.658 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 false 8 thrpt 20 25983680.959 ? 5045782.289 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 false 64 thrpt 20 26235589.339 ? 2867004.950 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 true 8 thrpt 20 18499027.808 ? 4784684.268 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 true 64 thrpt 20 16825286.141 ? 1008712.342 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 false 8 thrpt 20 5789879.092 ? 1201786.359 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 false 64 thrpt 20 2173243.225 ? 417809.341 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 true 8 thrpt 20 5035583.011 ? 1001978.854 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 true 64 thrpt 20 2162345.301 ? 402410.408 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 false 8 thrpt 20 30039052.376 ? 6539111.622 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 false 64 thrpt 20 31414163.515 ? 2096710.526 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 true 8 thrpt 20 19538587.855 ? 4639115.572 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 true 64 thrpt 20 19467839.722 ? 1672687.213 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII false 8 thrpt 20 10787326.745 ? 1034197.864 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII false 64 thrpt 20 7129801.930 ? 1363019.209 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII true 8 thrpt 20 9002529.605 ? 2017642.445 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII true 64 thrpt 20 3860192.352 ? 826218.738 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 false 8 thrpt 20 10532838.027 ? 2151743.968 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 false 64 thrpt 20 7185554.597 ? 1387685.785 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 true 8 thrpt 20 7352253.316 ? 1333823.850 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 true 64 thrpt 20 2825578.707 ? 349701.156 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 false 8 thrpt 20 7277446.665 ? 1447034.346 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 false 64 thrpt 20 2445929.579 ? 562816.641 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 true 8 thrpt 20 6201174.401 ? 1236137.786 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 true 64 thrpt 20 2310674.973 ? 525587.959 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 false 8 thrpt 20 11142625.392 ? 1680556.468 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 false 64 thrpt 20 8127116.405 ? 1128513.860 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 true 8 thrpt 20 9405751.952 ? 2193324.806 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 true 64 thrpt 20 3943282.076 ? 737798.070 ops/s Benchmark result is saved to /home/norman/mainframer/netty/microbench/target/reports/performance/ByteBufUtilDecodeStringBenchmark.json Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,030.173 sec - in io.netty.buffer.ByteBufUtilDecodeStringBenchmark [1030.460s][info ][gc,heap,exit ] Heap [1030.460s][info ][gc,heap,exit ] garbage-first heap total 516096K, used 257918K [0x0000000609a00000, 0x0000000800000000) [1030.460s][info ][gc,heap,exit ] region size 2048K, 127 young (260096K), 2 survivors (4096K) [1030.460s][info ][gc,heap,exit ] Metaspace used 17123K, capacity 17438K, committed 17792K, reserved 1064960K [1030.460s][info ][gc,heap,exit ] class space used 1709K, capacity 1827K, committed 1920K, reserved 1048576K	2018-10-19 14:00:13 +02:00
Norman Maurer	e542a2cf26	Use a non-volatile read for ensureAccessible() whenever possible to reduce overhead and allow better inlining. (#8266 ) Motiviation: At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives. It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here. The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check. Modifications: - Try to do a non-volatile read using sun.misc.Unsafe if we can use it. - Add a benchmark Result: Big performance win when multiple ByteBuf methods are called from a method. With the change: UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 281395842,128 ± 5050792,296 ops/s Before the change: UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 217419832,801 ± 5080579,030 ops/s	2018-09-07 07:47:02 +02:00
Norman Maurer	02d559e6a4	Remove flags when running benchmarks. (#8262 ) Motivation: Some of the flags we used are not supported anymore on more recent JDK versions. We should just remove all of them and only keep what we really need. This may also reflect better what people use in production. Modifications: Remove some flags when running the benchmarks. Result: Benchmarks also run with JDK11.	2018-09-05 19:05:02 +02:00
Carl Mastrangelo	379a56ca49	Add an Epoll benchmark Motivation: Optimizing the Epoll channel needs an objective measure of how fast it is. Modification: Add a simple, closed loop, ping-pong benchmark. Result: Benchmark can be used to measure #7816 Initial numbers: ``` Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong": 22614.403 ±(99.9%) 797.263 ops/s [Average] (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130 CI (99.9%): [21817.140, 23411.666] (assumes normal distribution) Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s ```	2018-09-04 10:15:15 +02:00
Francesco Nigro	c78be33443	Added configurable ByteBuf bounds checking (#7521 ) Motivation: The JVM isn't always able to hoist out/reduce bounds checking (due to ref counting operations etc etc) hence making it configurable could improve performances for most CPU intensive use cases. Modifications: Each AbstractByteBuf bounds check has been tested against a new static final configuration property similar to checkAccessible ie io.netty.buffer.bytebuf.checkBounds. Result: Any user could disable ByteBuf bounds checking in order to get extra performances.	2018-09-03 20:33:47 +02:00
Norman Maurer	83710cb2e1	Replace toArray(new T[size]) with toArray(new T[0]) to eliminate zero-out and allow the VM to optimize. (#8075 ) Motivation: Using toArray(new T[0]) is usually the faster aproach these days. We should use it. See also https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_conclusion. Modifications: Replace toArray(new T[size]) with toArray(new T[0]). Result: Faster code.	2018-06-29 07:56:04 +02:00
unknown	4a8d3a274c	Including the setup code in the benchmark method to avoid JMH Invocation level hiccups. Motivation: The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant overhead in the benchmark time (see org.openjdk.jmh.annotations.Level documentation). In the case of CodecInputListBenchmark, benchmarks are far too small (less than 50ns) and the Invocation level setup offsets the measurement considerably. On such cases, the recommended fix patch is to include the setup/teardown code in the benchmark method. Modifications: Include the setup/teardown code in the relevant benchmark methods. Remove the setup/teardown methods from the benchmark class. Result: We run the entire benchmark 10 times with default parameters we observed: - ArrayList benchmark affected directly by JMH overhead is now from 15-80% faster. - CodecList benchmark is now 50% faster than original (even with the setup code being measured). - Recyclable ArrayList is ~30% slower. - All benchmarks have significant different means (ANOVA) and medians (Moore) Mode: Throughput (Higher the better) Method Full params Factor Modified (Median) Original (Median) recyclableArrayList (elements = 1) 0.615520967 21719082.75 35285691.2 recyclableArrayList (elements = 4) 0.699553431 17149442.76 24514843.31 arrayList (elements = 4) 1.152666631 27120407.18 23528404.88 codecOutList (elements = 1) 1.527275908 67251089.04 44033359.47 codecOutList (elements = 4) 1.596917095 59174088.78 37055204.03 arrayList (elements = 1) 1.878616889 62188238.24 33103204.06 Environment: Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.	2018-06-21 12:22:13 +02:00
unknown	cb420a9ffc	Including the setup code in the benchmark method to avoid JMH Invocation level hiccups. Motivation: The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant impact in in the benchmark time (see org.openjdk.jmh.annotations.Level documentation). When the benchmark and the setup/teardown is too small (less than a milisecond) the Invocation level might saturate the system with timestamp requests and iteration synchronizations which introduce artificial latency, throughput, and scalability bottlenecks. In the HeadersBenchmark, all benchmarks take less than 100ns and the Invocation level setup offsets the measurement considerably. As fixture methods is defined for the entire class, this overhead also impacts every single benchmark in this class, not only the ones that use the emptyHttpHeaders object (cleaned in the setup). The recommended fix patch here is to include the setup/teardown code in the benchmark where the object is used. Modifications: Include the setup/teardown code in the relevant benchmark methods. Remove the setup/teardown method of Invocation level from the benchmark class. Result: We run all benchmarks from HeadersBenchmark 10 times with default parameter, we observe: - Benchmarks that were not directly affected by the fix patch, improved execution time. For instance, http2Remove with (exampleHeader = THREE) had its median reported as 2x faster than the original version. - Benchmarks that had the setup code inserted (eg. http2AddAllFastest) did not suffer a significant punch in the execution time, as the benchmarks are not dominated by the clear(). Environment: Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.	2018-06-21 12:21:19 +02:00
Scott Mitchell	9d51a40df0	Update NetUtilBenchmark (#7826 ) Motivation: NetUtilBenchmark is using out of date data, throws an exception in the benchmark, and allocates a Set on each run. Modifications: - Update the benchmark and reduce each run's overhead Result: NetUtilBenchmark is updated.	2018-03-31 08:27:08 +02:00
Francesco Nigro	ed46c4ed00	Copies from read-only heap ByteBuffer to direct ByteBuf can avoid stealth ByteBuf allocation and additional copies Motivation: Read-only heap ByteBuffer doesn't expose array: the existent method to perform copies to direct ByteBuf involves the creation of a (maybe pooled) additional heap ByteBuf instance and copy Modifications: To avoid stressing the allocator with additional (and stealth) heap ByteBuf allocations is provided a method to perform copies using the (pooled) internal NIO buffer Result: Copies from read-only heap ByteBuffer to direct ByteBuf won't create any intermediate ByteBuf	2018-02-27 09:54:21 +09:00
Julien Hoarau	3e6b54bb59	Fix failing h2spec tests 8.1.2.1 related to pseudo-headers validation Motivation: According to the spec: All pseudo-header fields MUST appear in the header block before regular header fields. Any request or response that contains a pseudo-header field that appears in a header block after a regular header field MUST be treated as malformed (Section 8.1.2.6). Pseudo-header fields are only valid in the context in which they are defined. Pseudo-header fields defined for requests MUST NOT appear in responses; pseudo-header fields defined for responses MUST NOT appear in requests. Pseudo-header fields MUST NOT appear in trailers. Endpoints MUST treat a request or response that contains undefined or invalid pseudo-header fields as malformed (Section 8.1.2.6). Clients MUST NOT accept a malformed response. Note that these requirements are intended to protect against several types of common attacks against HTTP; they are deliberately strict because being permissive can expose implementations to these vulnerabilities. Modifications: - Introduce validation in HPackDecoder Result: - Requests with unknown pseudo-field headers are rejected - Requests with containing response specific pseudo-headers are rejected - Requests where pseudo-header appear after regular header are rejected - h2spec 8.1.2.1 pass	2018-01-29 19:42:56 -08:00
Norman Maurer	4c1e0f596a	Use FastThreadLocal for CodecOutputList Motivation: We used Recycler for the CodecOutputList which is not optimized for the use-case of access only from the same Thread all the time. Modifications: - Use FastThreadLocal for CodecOutputList - Add benchmark Result: Less overhead in our codecs.	2018-01-23 11:34:28 +01:00
Francesco Nigro	1cf2687244	Fixed JMH ByteBuf benchmark to avoid dead code elimination Motivation: The JMH doc suggests to use BlackHoles to avoid dead code elimination hence would be better to follow this best practice. Modifications: Each benchmark method is returning the ByteBuf/ByteBuffer to avoid the JVM to perform any dead code elimination. Result: The results are more reliable and comparable to the others provided by other ByteBuf benchmarks (eg HeapByteBufBenchmark)	2017-12-19 14:09:18 +01:00
Scott Mitchell	55ef09f191	Add HttpObjectEncoderBenchmark Motivation: Benchmark to measure HttpObjectEncoder performance. Modifications: - Create new benchmark HttpObjectEncoderBenchmark Result: JMH Microbenchmark for HttpObjectEncoder.	2017-12-16 13:47:34 +01:00
Scott Mitchell	5f0342ebe0	Add RedisEncoderBenchmark Motivation: Add a benchmark to measure RedisEncoder's performance Modifications: - Add RedisEncoderBenchmark Result: JMH benchmark exists to measure RedisEncoder's performance.	2017-12-16 13:42:50 +01:00
Scott Mitchell	93b144b7b4	HttpMethod#valueOf improvement Motivation: HttpMethod#valueOf shows up on profiler results in the top set of results. Since it is a relatively simple operation it can be improved in isolation. Modifications: - Introduce a special case map which assigns each HttpMethod to a unique index in an array and provides constant time lookup from a hash code algorithm. When the bucket is matched we can then directly do equality comparison instead of potentially following a linked structure when HashMap has hash collisions. Result: ~10% improvement in benchmark results for HttpMethod#valueOf Benchmark Mode Cnt Score Error Units HttpMethodMapBenchmark.newMapKnownMethods thrpt 16 31.831 ± 0.928 ops/us HttpMethodMapBenchmark.newMapMixMethods thrpt 16 25.568 ± 0.400 ops/us HttpMethodMapBenchmark.newMapUnknownMethods thrpt 16 51.413 ± 1.824 ops/us HttpMethodMapBenchmark.oldMapKnownMethods thrpt 16 29.226 ± 0.330 ops/us HttpMethodMapBenchmark.oldMapMixMethods thrpt 16 21.073 ± 0.247 ops/us HttpMethodMapBenchmark.oldMapUnknownMethods thrpt 16 49.081 ± 0.577 ops/us	2017-11-20 11:07:50 -08:00
Scott Mitchell	e6126215e0	DefaultHttp2FrameWriter reduce object allocation Motivation: DefaultHttp2FrameWriter#writeData allocates a DataFrameHeader for each write operation. DataFrameHeader maintains internal state and allocates multiple slices of a buffer which is a maximum of 30 bytes. This 30 byte buffer may not always be necessary and the additional slice operations can utilize retainedSlice to take advantage of pooled objects. We can also save computation and object allocations if there is no padding which is a common case in practice. Modifications: - Remove DataFrameHeader - Add a fast path for padding == 0 Result: Less object allocation in DefaultHttp2FrameWriter	2017-11-20 08:10:59 -08:00
Anuraag Agrawal	1f1a60ae7d	Use Netty's DefaultPriorityQueue instead of JDK's PriorityQueue for scheduled tasks Motivation: `AbstractScheduledEventExecutor` uses a standard `java.util.PriorityQueue` to keep track of task deadlines. `ScheduledFuture.cancel` removes tasks from this `PriorityQueue`. Unfortunately, `PriorityQueue.remove` has `O(n)` performance since it must search for the item in the entire queue before removing it. This is fast when the future is at the front of the queue (e.g., already triggered) but not when it's randomly located in the queue. Many servers will use `ScheduledFuture.cancel` on all requests, e.g., to manage a request timeout. As these cancellations will be happen in arbitrary order, when there are many scheduled futures, `PriorityQueue.remove` is a bottleneck and greatly hurts performance with many concurrent requests (>10K). Modification: Use netty's `DefaultPriorityQueue` for scheduling futures instead of the JDK. `DefaultPriorityQueue` is almost identical to the JDK version except it is able to remove futures without searching for them in the queue. This means `DefaultPriorityQueue.remove` has `O(log n)` performance. Result: Before - cancelling futures has varying performance, capped at `O(n)` After - cancelling futures has stable performance, capped at `O(log n)` Benchmark results After - cancelling in order and in reverse order have similar performance within `O(log n)` bounds ``` Benchmark (num) Mode Cnt Score Error Units ScheduledFutureTaskBenchmark.cancelInOrder 100 thrpt 20 137779.616 ± 7709.751 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 1000 thrpt 20 11049.448 ± 385.832 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 10000 thrpt 20 943.294 ± 12.391 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 100000 thrpt 20 64.210 ± 1.824 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100 thrpt 20 167531.096 ± 9187.865 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 1000 thrpt 20 33019.786 ± 4737.770 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 10000 thrpt 20 2976.955 ± 248.555 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100000 thrpt 20 362.654 ± 45.716 ops/s ``` Before - cancelling in order and in reverse order have significantly different performance at higher queue size, orders of magnitude worse than the new implementation. ``` Benchmark (num) Mode Cnt Score Error Units ScheduledFutureTaskBenchmark.cancelInOrder 100 thrpt 20 139968.586 ± 12951.333 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 1000 thrpt 20 12274.420 ± 337.800 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 10000 thrpt 20 958.168 ± 15.350 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 100000 thrpt 20 53.381 ± 13.981 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100 thrpt 20 123918.829 ± 3642.517 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 1000 thrpt 20 5099.810 ± 206.992 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 10000 thrpt 20 72.335 ± 0.443 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100000 thrpt 20 0.743 ± 0.003 ops/s ```	2017-11-10 23:09:32 -08:00
Carl Mastrangelo	83a19d5650	Optimistically update ref counts Motivation: Highly retained and released objects have contention on their ref count. Currently, the ref count is updated using compareAndSet with care to make sure the count doesn't overflow, double free, or revive the object. Profiling has shown that a non trivial (~1%) of CPU time on gRPC latency benchmarks is from the ref count updating. Modification: Rather than pessimistically assuming the ref count will be invalid, optimistically update it assuming it will be. If the update was wrong, then use the slow path to revert the change and throw an execption. Most of the time, the ref counts are correct. This changes from using compareAndSet to getAndAdd, which emits a different CPU instruction on x86 (CMPXCHG to XADD). Because the CPU knows it will modifiy the memory, it can avoid contention. On a highly contended machine, this can be about 2x faster. There is a downside to the new approach. The ref counters can temporarily enter invalid states if over retained or over released. The code does handle these overflow and underflow scenarios, but it is possible that another concurrent access may push the failure to a different location. For example: Time 1 Thread 1: obj.retain(INT_MAX - 1) Time 2 Thread 1: obj.retain(2) Time 2 Thread 2: obj.retain(1) Previously Thread 2 would always succeed and Thread 1 would always fail on the second access. Now, thread 2 could fail while thread 1 is rolling back its change. ==== There are a few reasons why I think this is okay: 1. Buggy code is going to have bugs. An exception _is_ going to be thrown. This just causes the other threads to notice the state is messed up and stop early. 2. If high retention counts are a use case, then ref count should be a long rather than an int. 3. The critical section is greatly reduced compared to the previous version, so the likelihood of this happening is lower 4. On error, the code always rollsback the change atomically, so there is no possibility of corruption. Result: Faster refcounting ``` BEFORE: Benchmark (delay) Mode Cnt Score Error Units AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1 sample 2901361 804.579 ± 1.835 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10 sample 3038729 785.376 ± 16.471 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 100 sample 2899401 817.392 ± 6.668 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1000 sample 3650566 2077.700 ± 0.600 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10000 sample 3005467 19949.334 ± 4.243 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1 sample 456091 48.610 ± 1.162 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10 sample 732051 62.599 ± 0.815 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 100 sample 778925 228.629 ± 1.205 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1000 sample 633682 2002.987 ± 2.856 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10000 sample 506442 19735.345 ± 12.312 ns/op AFTER: Benchmark (delay) Mode Cnt Score Error Units AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1 sample 3761980 383.436 ± 1.315 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10 sample 3667304 474.429 ± 1.101 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 100 sample 3039374 479.267 ± 0.435 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1000 sample 3709210 2044.603 ± 0.989 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10000 sample 3011591 19904.227 ± 18.025 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1 sample 494975 52.269 ± 8.345 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10 sample 771094 62.290 ± 0.795 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 100 sample 763230 235.044 ± 1.552 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1000 sample 634037 2006.578 ± 3.574 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10000 sample 506284 19742.605 ± 13.729 ns/op ```	2017-10-04 08:42:33 +02:00
Norman Maurer	3c8c7fc7e9	Reduce performance overhead of ResourceLeakDetector Motiviation: The ResourceLeakDetector helps to detect and troubleshoot resource leaks and is often used even in production enviroments with a low level. Because of this its import that we try to keep the overhead as low as overhead. Most of the times no leak is detected (as all is correctly handled) so we should keep the overhead for this case as low as possible. Modifications: - Only call getStackTrace() if a leak is reported as it is a very expensive native call. Also handle the filtering and creating of the String in a lazy fashion - Remove the need to mantain a Queue to store the last access records - Add benchmark Result: Huge decrease of performance overhead. Before the patch: Benchmark (recordTimes) Mode Cnt Score Error Units ResourceLeakDetectorRecordBenchmark.record 8 thrpt 20 4358.367 ± 116.419 ops/s ResourceLeakDetectorRecordBenchmark.record 16 thrpt 20 2306.027 ± 55.044 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 8 thrpt 20 4220.979 ± 114.046 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 16 thrpt 20 2250.734 ± 55.352 ops/s With this patch: Benchmark (recordTimes) Mode Cnt Score Error Units ResourceLeakDetectorRecordBenchmark.record 8 thrpt 20 71398.957 ± 2695.925 ops/s ResourceLeakDetectorRecordBenchmark.record 16 thrpt 20 38643.963 ± 1446.694 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 8 thrpt 20 71677.882 ± 2923.622 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 16 thrpt 20 38660.176 ± 1467.732 ops/s	2017-09-18 16:36:19 -07:00
Nikolay Fedorovskikh	df568c739e	Use ByteBuf#writeShort/writeMedium instead of writeBytes Motivation: 1. Some encoders used a `ByteBuf#writeBytes` to write short constant byte array (2-3 bytes). This can be replaced with more faster `ByteBuf#writeShort` or `ByteBuf#writeMedium` which do not access the memory. 2. Two chained calls of the `ByteBuf#setByte` with constants can be replaced with one `ByteBuf#setShort` to reduce index checks. 3. The signature of method `HttpHeadersEncoder#encoderHeader` has an unnecessary `throws`. Modifications: 1. Use `ByteBuf#writeShort` or `ByteBuf#writeMedium` instead of `ByteBuf#writeBytes` for the constants. 2. Use `ByteBuf#setShort` instead of chained call of the `ByteBuf#setByte` with constants. 3. Remove an unnecessary `throws` from `HttpHeadersEncoder#encoderHeader`. Result: A bit faster writes constants into buffers.	2017-07-10 14:37:41 +02:00
Dmitriy Dumanskiy	dd69a813d4	Performance improvement for HttpRequestEncoder. Insert char into the string optimized. Motivation: Right now HttpRequestEncoder does insertion of slash for url like http://localhost?pararm=1 before the question mark. It is done not effectively. Modification: Code: new StringBuilder(len + 1) .append(uri, 0, index) .append(SLASH) .append(uri, index, len) .toString(); Replaced with: new StringBuilder(uri) .insert(index, SLASH) .toString(); Result: Faster HttpRequestEncoder. Additional small test. Attached benchmark in PR. Benchmark Mode Cnt Score Error Units HttpRequestEncoderInsertBenchmark.newEncoder thrpt 40 3704843.303 ± 98950.919 ops/s HttpRequestEncoderInsertBenchmark.oldEncoder thrpt 40 3284236.960 ± 134433.217 ops/s	2017-06-27 10:53:43 +02:00
Nikolay Fedorovskikh	aa38b6a769	Prevent unnecessary allocations in the `StringUtil#escapeCsv` Motivation: A `StringUtil#escapeCsv` creates new `StringBuilder` on each value even if the same string is returned in the end. Modifications: Create new `StringBuilder` only if it really needed. Otherwise, return the original string (or just trimmed substring). Result: Less GC load. Up to 4x faster work for not changed strings.	2017-06-13 14:57:38 -07:00
Dmitriy Dumanskiy	acc07fac32	disabling leak detection micro benchmark Motivation: When I run Netty micro benchmarks I get many warnings like: WARNING: -Dio.netty.noResourceLeakDetection is deprecated. Use '-Dio.netty.leakDetection.level=simple' instead. Modification: -Dio.netty.noResourceLeakDetection replaced with -Dio.netty.leakDetection.level=disabled. Result: No warnings.	2017-06-09 18:03:54 +02:00
Nikolay Fedorovskikh	e4531918a3	Optimizations in NetUtil Motivation: IPv4/6 validation methods use allocations, which can be avoided. IPv4 parse method use StringTokenizer. Modifications: Rewriting IPv4/6 validation methods to avoid allocations. Rewriting IPv4 parse method without use StringTokenizer. Result: IPv4/6 validation and IPv4 parsing faster up to 2-10x.	2017-05-18 16:42:22 -07:00
Nikolay Fedorovskikh	0692bf1b6a	fix the typos	2017-04-20 04:56:09 +02:00
Norman Maurer	e482d933f7	Add 'io.netty.tryAllocateUninitializedArray' system property which allows to allocate byte[] without memset in Java9+ Motivation: Java9 added a new method to Unsafe which allows to allocate a byte[] without memset it. This can have a massive impact in allocation times when the byte[] is big. This change allows to enable this when using Java9 with the io.netty.tryAllocateUninitializedArray property when running Java9+. Please note that you will need to open up the jdk.internal.misc package via '--add-opens java.base/jdk.internal.misc=ALL-UNNAMED' as well. Modifications: Allow to allocate byte[] without memset on Java9+ Result: Better performance when allocate big heap buffers and using java9.	2017-04-19 11:45:39 +02:00
Ade Setyawan Sajim	016629fe3b	Replace system.out.println with InternalLoggerFactory Motivation: There are two files that still use `system.out.println` to log their status Modification: Replace `system.out.println` with a `debug` function inside an instance of `InternalLoggerFactory` Result: Introduce an instance of `InternalLoggerFactory` in class `AbstractMicrobenchmark.java` and `AbstractSharedExecutorMicrobenchmark.java`	2017-03-28 14:51:59 +02:00
Scott Mitchell	743d2d374c	SslHandler benchmark and SslEngine multiple packets benchmark Motivation: We currently don't have a benchmark which includes SslHandler. The SslEngine benchmarks also always include a single TLS packet when encoding/decoding. In practice when reading data from the network there may be multiple TLS packets present and we should expand the benchmarks to understand this use case. Modifications: - SslEngine benchmarks should include wrapping/unwrapping of multiple TLS packets - Introduce SslHandler benchmarks which can also account for wrapping/unwrapping of multiple TLS packets Result: SslHandler and SslEngine benchmarks are more comprehensive.	2017-03-06 08:42:39 -08:00
Scott Mitchell	f9001b9fc0	HTTP/2 move internal HPACK classes to the http2 package Motivation: The internal.hpack classes are no longer exposed in our public APIs and can be made package private in the http2 package. Modifications: - Make the hpack classes package private in the http2 package Result: Less APIs exposed as public.	2017-03-02 07:42:41 -08:00
Norman Maurer	461f9a1212	Allow to obtain informations of used direct and heap memory for ByteBufAllocator implementations Motivation: Often its useful for the user to be able to get some stats about the memory allocated via an allocator. Modifications: - Allow to obtain the used heap and direct memory for an allocator - Add test case Result: Fixes [#6341]	2017-03-01 18:53:43 +01:00
Norman Maurer	90a61046c7	Add benchmarks for UnpooledUnsafeNoCleanerDirectByteBuf vs UnpooledUnsafeDirectByteBuf Motivation: Issue [#6349] brought up the idea to not use UnpooledUnsafeNoCleanerDirectByteBuf by default. To decide what to do a benchmark is needed. Modifications: Add benchmarks for UnpooledUnsafeNoCleanerDirectByteBuf vs UnpooledUnsafeDirectB yteBuf Result: Better idea about impact of using UnpooledUnsafeNoCleanerDirectByteBuf.	2017-02-27 20:04:09 +01:00
Norman Maurer	d73477c7bd	Add benchmarks for SSLEngine implementations Motivation: As we provide our own SSLEngine implementation we should have benchmarks to compare it against JDK impl. Modifications: Add benchmarks for wrap / unwrap and handshake performance. Result: Benchmarks FTW.	2017-02-24 08:02:10 +01:00
Norman Maurer	a80d3411ee	Move all the microbenchmark code into one directory. Motivation: Allmost all our benchmarks are in src/main/java but a few are in src/test/java. We should make it consistent. Modifications: Move everything to src/main/java Result: Consistent code base.	2017-02-23 19:59:09 +01:00
Nikolay Fedorovskikh	0623c6c533	Fix javadoc issues Motivation: Invalid javadoc in project Modifications: Fix it Result: More correct javadoc	2017-02-22 07:31:07 +01:00
Nikolay Fedorovskikh	634a8afa53	Fix some warnings at generics usage Motivation: Existing warnings from java compiler Modifications: Add/fix type parameters Result: Less warnings	2017-02-22 07:29:59 +01:00
Kiril Menshikov	66b9be3a46	Allow to allign allocated Buffers Motivation: 64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes. Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code. Modification: At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files Result: Buffer alignment works better than miss-align cache.	2017-02-06 07:58:29 +01:00
Scott Mitchell	3482651e0c	HTTP/2 Non Active Stream RFC Corrections Motivation: codec-http2 couples the dependency tree state with the remainder of the stream state (Http2Stream). This makes implementing constraints where stream state and dependency tree state diverge in the RFC challenging. For example the RFC recommends retaining dependency tree state after a stream transitions to closed [1]. Dependency tree state can be exchanged on streams in IDLE. In practice clients may use stream IDs for the purpose of establishing QoS classes and therefore retaining this dependency tree state can be important to client perceived performance. It is difficult to limit the total amount of state we retain when stream state and dependency tree state is combined. Modifications: - Remove dependency tree, priority, and weight related items from public facing Http2Connection and Http2Stream APIs. This information is optional to track and depends on the flow controller implementation. - Move all dependency tree, priority, and weight related code from DefaultHttp2Connection to WeightedFairQueueByteDistributor. This is currently the only place which cares about priority. We can pull out the dependency tree related code in the future if it is generally useful to expose for other implementations. - DefaultHttp2Connection should explicitly limit the number of reserved streams now that IDLE streams are no longer created. Result: More compliant with the HTTP/2 RFC. Fixes https://github.com/netty/netty/issues/6206. [1] https://tools.ietf.org/html/rfc7540#section-5.3.4	2017-02-01 10:34:27 -08:00
Scott Mitchell	e13da218e9	HTTP/2 revert Http2FrameWriter throws API change Motivation: `2fd42cfc6b` fixed a bug related to encoding headers but it also introduced a throws statement onto the Http2FrameWriter methods which write headers. This throws statement makes the API more verbose and is not necessary because we can communicate the failure in the ChannelFuture that is returned by these methods. Modifications: - Remove throws from all Http2FrameWriter methods. Result: Http2FrameWriter APIs do not propagate checked exceptions.	2017-01-26 23:26:17 -08:00
Tim Brooks	3344cd21ac	Wrap operations requiring SocketPermission with doPrivileged blocks Motivation: Currently Netty does not wrap socket connect, bind, or accept operations in doPrivileged blocks. Nor does it wrap cases where a dns lookup might happen. This prevents an application utilizing the SecurityManager from isolating SocketPermissions to Netty. Modifications: I have introduced a class (SocketUtils) that wraps operations requiring SocketPermissions in doPrivileged blocks. Result: A user of Netty can grant SocketPermissions explicitly to the Netty jar, without granting it to the rest of their application.	2017-01-19 21:12:52 +01:00
Scott Mitchell	2fd42cfc6b	HTTP/2 Max Header List Size Bug Motivation: If the HPACK Decoder detects that SETTINGS_MAX_HEADER_LIST_SIZE has been violated it aborts immediately and sends a RST_STREAM frame for what ever stream caused the issue. Because HPACK is stateful this means that the HPACK state may become out of sync between peers, and the issue won't be detected until the next headers frame. We should make a best effort to keep processing to keep the HPACK state in sync with our peer, or completely close the connection. If the HPACK Encoder is configured to verify SETTINGS_MAX_HEADER_LIST_SIZE it checks the limit and encodes at the same time. This may result in modifying the HPACK local state but not sending the headers to the peer if SETTINGS_MAX_HEADER_LIST_SIZE is violated. This will also lead to an inconsistency in HPACK state that will be flagged at some later time. Modifications: - HPACK Decoder now has 2 levels of limits related to SETTINGS_MAX_HEADER_LIST_SIZE. The first will attempt to keep processing data and send a RST_STREAM after all data is processed. The second will send a GO_AWAY and close the entire connection. - When the HPACK Encoder enforces SETTINGS_MAX_HEADER_LIST_SIZE it should not modify the HPACK state until the size has been checked. - https://tools.ietf.org/html/rfc7540#section-6.5.2 states that the initial value of SETTINGS_MAX_HEADER_LIST_SIZE is "unlimited". We currently use 8k as a limit. We should honor the specifications default value so we don't unintentionally close a connection before the remote peer is aware of the local settings. - Remove unnecessary object allocation in DefaultHttp2HeadersDecoder and DefaultHttp2HeadersEncoder. Result: Fixes https://github.com/netty/netty/issues/6209.	2017-01-19 10:42:43 -08:00
Scott Mitchell	b701da8d1c	HTTP/2 HPACK Integer Encoding Bugs Motivation: - Decoder#decodeULE128 has a bounds bug and cannot decode Integer.MAX_VALUE - Decoder#decodeULE128 doesn't support values greater than can be represented with Java's int data type. This is a problem because there are cases that require at least unsigned 32 bits (max header table size). - Decoder#decodeULE128 treats overflowing the data type and invalid input the same. This can be misleading when inspecting the error that is thrown. - Encoder#encodeInteger doesn't support values greater than can be represented with Java's int data type. This is a problem because there are cases that require at least unsigned 32 bits (max header table size). Modifications: - Correct the above issues and add unit tests. Result: Fixes https://github.com/netty/netty/issues/6210.	2017-01-18 18:36:47 -08:00
Scott Mitchell	06e7627b5f	Read Only Http2Headers Motivation: A read only implementation of Http2Headers can allow for a more efficient usage of memory and more performant combined construction and iteration during serialization. Modifications: - Add a new ReadOnlyHttp2Headers class Result: ReadOnlyHttp2Headers exists and can be used for performance reasons when appropriate. ``` Benchmark (headerCount) Mode Cnt Score Error Units ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders 1 avgt 20 96.156 ± 1.902 ns/op ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders 5 avgt 20 157.925 ± 3.847 ns/op ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders 10 avgt 20 236.257 ± 2.663 ns/op ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders 20 avgt 20 392.861 ± 3.932 ns/op ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders 1 avgt 20 48.759 ± 0.466 ns/op ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders 5 avgt 20 113.122 ± 0.948 ns/op ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders 10 avgt 20 192.698 ± 1.936 ns/op ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders 20 avgt 20 348.974 ± 3.111 ns/op ReadOnlyHttp2HeadersBenchmark.defaultTrailers 1 avgt 20 35.694 ± 0.271 ns/op ReadOnlyHttp2HeadersBenchmark.defaultTrailers 5 avgt 20 98.993 ± 2.933 ns/op ReadOnlyHttp2HeadersBenchmark.defaultTrailers 10 avgt 20 171.035 ± 5.068 ns/op ReadOnlyHttp2HeadersBenchmark.defaultTrailers 20 avgt 20 330.621 ± 3.381 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders 1 avgt 20 40.573 ± 0.474 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders 5 avgt 20 56.516 ± 0.660 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders 10 avgt 20 76.890 ± 0.776 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders 20 avgt 20 117.531 ± 1.393 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders 1 avgt 20 29.206 ± 0.264 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders 5 avgt 20 44.587 ± 0.312 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders 10 avgt 20 64.458 ± 1.169 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders 20 avgt 20 107.179 ± 0.881 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers 1 avgt 20 21.563 ± 0.202 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers 5 avgt 20 41.019 ± 0.440 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers 10 avgt 20 64.053 ± 0.785 ns/op ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers 20 avgt 20 113.737 ± 4.433 ns/op ```	2016-12-18 09:32:24 -08:00
Stephane Landelle	f755e58463	Clean up following #6016 Motivation: * DefaultHeaders from netty-codec has some duplicated logic for header date parsing * Several classes keep on using deprecated HttpHeaderDateFormat Modifications: * Move HttpHeaderDateFormatter to netty-codec and rename it into HeaderDateFormatter * Make DefaultHeaders use HeaderDateFormatter * Replace HttpHeaderDateFormat usage with HeaderDateFormatter Result: Faster and more consistent code	2016-11-21 12:35:40 -08:00
Stephane Landelle	edc4842309	Fix cookie date parsing, close #6016 Motivation: * RFC6265 defines its own parser which is different from RFC1123 (it accepts RFC1123 format but also other ones). Basically, it's very lax on delimiters, ignores day of week and timezone. Currently, ClientCookieDecoder uses HttpHeaderDateFormat underneath, and can't parse valid cookies such as Github ones whose expires attribute looks like "Sun, 27 Nov 2016 19:37:15 -0000" * ServerSideCookieEncoder currently uses HttpHeaderDateFormat underneath for formatting expires field, and it's slow. Modifications: * Introduce HttpHeaderDateFormatter that correctly implement RFC6265 * Use HttpHeaderDateFormatter in ClientCookieDecoder and ServerCookieEncoder * Deprecate HttpHeaderDateFormat Result: * Proper RFC6265 dates support * Faster ServerCookieEncoder and ClientCookieDecoder * Faster tool for handling headers such as "Expires" and "Date"	2016-11-18 11:22:21 +00:00
buchgr	c9918de37b	http2: Make MAX_HEADER_LIST_SIZE exceeded a stream error when encoding. Motivation: The SETTINGS_MAX_HEADER_LIST_SIZE limit, as enforced by the HPACK Encoder, should be a stream error and not apply to the whole connection. Modifications: Made the necessary changes for the exception to be of type StreamException. Result: A HEADERS frame exceeding the limit, only affects a specific stream.	2016-10-17 09:24:06 -07:00
Scott Mitchell	540c26bb56	HTTP/2 Ensure default settings are correctly enforced and interfaces clarified Motivation: The responsibility for retaining the settings values and enforcing the settings constraints is spread out in different areas of the code and may be initialized with different values than the default specified in the RFC. This should not be allowed by default and interfaces which are responsible for maintaining/enforcing settings state should clearly indicate the restrictions that they should only be set by the codec upon receipt of a SETTINGS ACK frame. Modifications: - Encoder, Decoder, and the Headers Encoder/Decoder no longer expose public constructors that allow the default settings to be changed. - Http2HeadersDecoder#maxHeaderSize() exists to provide some bound when headers/continuation frames are being aggregated. However this is roughly the same as SETTINGS_MAX_HEADER_LIST_SIZE (besides the 32 byte octet for each header field) and can be used instead of attempting to keep the two independent values in sync. - Encoding headers now enforces SETTINGS_MAX_HEADER_LIST_SIZE at the octect level. Previously the header encoder compared the number of header key/value pairs against SETTINGS_MAX_HEADER_LIST_SIZE instead of the number of octets (plus 32 bytes overhead). - DefaultHttp2ConnectionDecoder#onData calls shouldIgnoreHeadersOrDataFrame but may swallow exceptions from this method. This means a STREAM_RST frame may not be sent when it should for an unknown stream and thus violate the RFC. The exception is no longer swallowed. Result: Default settings state is enforced and interfaces related to settings state are clarified.	2016-10-07 13:00:45 -07:00
radai-rosenblatt	15ac6c4a1f	Clean-up unused imports Motivation: the build doesnt seem to enforce this, so they piled up Modifications: removed unused import lines Result: less unused imports Signed-off-by: radai-rosenblatt <radai.rosenblatt@gmail.com>	2016-09-30 09:08:50 +02:00
buchgr	67d3a78123	Reduce bytecode size of PlatformDependent0.equals. Motivation: PP0.equals has a bytecode size of 476. This is above the default inlining threshold of OpenJDK. Modifications: Slightly change the method to reduce the bytecode size by > 50% to 212 bytes. Result: The bytecode size is dramatically reduced, making the method a candidate for inlining. The relevant code in our application (gRPC) that relies heavily on equals comparisons, runs some ~10% faster. The Netty JMH benchmark shows no performance regression. Current 4.1: PlatformDependentBenchmark.unsafeBytesEqual 10 avgt 20 7.836 ± 0.113 ns/op PlatformDependentBenchmark.unsafeBytesEqual 50 avgt 20 16.889 ± 4.284 ns/op PlatformDependentBenchmark.unsafeBytesEqual 100 avgt 20 15.601 ± 0.296 ns/op PlatformDependentBenchmark.unsafeBytesEqual 1000 avgt 20 95.885 ± 1.992 ns/op PlatformDependentBenchmark.unsafeBytesEqual 10000 avgt 20 824.429 ± 12.792 ns/op PlatformDependentBenchmark.unsafeBytesEqual 100000 avgt 20 8907.035 ± 177.844 ns/op With this change: PlatformDependentBenchmark.unsafeBytesEqual 10 avgt 20 5.616 ± 0.102 ns/op PlatformDependentBenchmark.unsafeBytesEqual 50 avgt 20 17.896 ± 0.373 ns/op PlatformDependentBenchmark.unsafeBytesEqual 100 avgt 20 14.952 ± 0.210 ns/op PlatformDependentBenchmark.unsafeBytesEqual 1000 avgt 20 94.799 ± 1.604 ns/op PlatformDependentBenchmark.unsafeBytesEqual 10000 avgt 20 834.996 ± 17.484 ns/op PlatformDependentBenchmark.unsafeBytesEqual 100000 avgt 20 8757.421 ± 187.555 ns/op	2016-09-09 07:57:41 +02:00

1 2

97 Commits