netty5

Author	SHA1	Message	Date
root	7cf69022d4	[maven-release-plugin] prepare release netty-4.1.41.Final	2019-09-12 16:09:00 +00:00
root	aef47bec7f	[maven-release-plugin] prepare for next development iteration	2019-09-12 05:38:11 +00:00
root	267e5da481	[maven-release-plugin] prepare release netty-4.1.40.Final	2019-09-12 05:37:30 +00:00
root	d45a4ce01b	[maven-release-plugin] prepare for next development iteration	2019-08-13 17:16:42 +00:00
root	88c2a4cab5	[maven-release-plugin] prepare release netty-4.1.39.Final	2019-08-13 17:15:20 +00:00
root	718b7626e6	[maven-release-plugin] prepare for next development iteration	2019-07-24 09:05:57 +00:00
root	465c900c04	[maven-release-plugin] prepare release netty-4.1.38.Final	2019-07-24 09:05:23 +00:00
jingene	c0f9364870	Change the netty.io homepage scheme(http -> https) (#9344 ) Motivation: Netty homepage(netty.io) serves both "http" and "https". It's recommended to use https than http. Modification: I changed from "http://netty.io" to "https://netty.io" Result: No effects.	2019-07-09 21:09:42 +02:00
Norman Maurer	6da809dc11	Increase maxHeaderListSize for HpackDecoderBenchmark to be able to be… (#9321 ) Motivation: The previous used maxHeaderListSize was too low which resulted in exceptions during the benchmark run: ``` io.netty.handler.codec.http2.Http2Exception: Header size exceeded max allowed size (8192) at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:103) at io.netty.handler.codec.http2.Http2Exception.headerListSizeError(Http2Exception.java:188) at io.netty.handler.codec.http2.Http2CodecUtil.headerListSizeExceeded(Http2CodecUtil.java:231) at io.netty.handler.codec.http2.HpackDecoder$Http2HeadersSink.finish(HpackDecoder.java:545) at io.netty.handler.codec.http2.HpackDecoder.decode(HpackDecoder.java:132) at io.netty.handler.codec.http2.HpackDecoderBenchmark.decode(HpackDecoderBenchmark.java:85) at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_thrpt_jmhStub(HpackDecoderBenchmark_decode_jmhTest.java:120) at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_Throughput(HpackDecoderBenchmark_decode_jmhTest.java:83) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453) at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` Also we should ensure we only use ascii for header names. Modifications: Just use Integer.MAX_VALUE as limit Result: Be able to run benchmark without exceptions	2019-07-04 11:24:13 +02:00
Carl Mastrangelo	ff0045e3e1	Use Table lookup for HPACK decoder (#9307 ) Motivation: Table based decoding is fast. Modification: Use table based decoding in HPACK decoder, inspired by https://github.com/python-hyper/hpack/blob/master/hpack/huffman_table.py This modifies the table to be based on integers, rather than 3-tuples of bytes. This is for two reasons: 1. It's faster 2. Using bytes makes the static intializer too big, and doesn't compile. Result: Faster Huffman decoding. This only seems to help the ascii case, the other decoding is about the same. Benchmarks: ``` Before: Benchmark (limitToAscii) (sensitive) (size) Mode Cnt Score Error Units HpackDecoderBenchmark.decode true true SMALL thrpt 20 426293.636 ± 1444.843 ops/s HpackDecoderBenchmark.decode true true MEDIUM thrpt 20 57843.738 ± 725.704 ops/s HpackDecoderBenchmark.decode true true LARGE thrpt 20 3002.412 ± 16.998 ops/s HpackDecoderBenchmark.decode true false SMALL thrpt 20 412339.400 ± 1128.394 ops/s HpackDecoderBenchmark.decode true false MEDIUM thrpt 20 58226.870 ± 199.591 ops/s HpackDecoderBenchmark.decode true false LARGE thrpt 20 3044.256 ± 10.675 ops/s HpackDecoderBenchmark.decode false true SMALL thrpt 20 2082615.030 ± 5929.726 ops/s HpackDecoderBenchmark.decode false true MEDIUM thrpt 10 571640.454 ± 26499.229 ops/s HpackDecoderBenchmark.decode false true LARGE thrpt 20 92714.555 ± 2292.222 ops/s HpackDecoderBenchmark.decode false false SMALL thrpt 20 1745872.421 ± 6788.840 ops/s HpackDecoderBenchmark.decode false false MEDIUM thrpt 20 490420.323 ± 2455.431 ops/s HpackDecoderBenchmark.decode false false LARGE thrpt 20 84536.200 ± 398.714 ops/s After(bytes): Benchmark (limitToAscii) (sensitive) (size) Mode Cnt Score Error Units HpackDecoderBenchmark.decode true true SMALL thrpt 20 472649.148 ± 7122.461 ops/s HpackDecoderBenchmark.decode true true MEDIUM thrpt 20 66739.638 ± 341.607 ops/s HpackDecoderBenchmark.decode true true LARGE thrpt 20 3139.773 ± 24.491 ops/s HpackDecoderBenchmark.decode true false SMALL thrpt 20 466933.833 ± 4514.971 ops/s HpackDecoderBenchmark.decode true false MEDIUM thrpt 20 66111.778 ± 568.326 ops/s HpackDecoderBenchmark.decode true false LARGE thrpt 20 3143.619 ± 3.332 ops/s HpackDecoderBenchmark.decode false true SMALL thrpt 20 2109995.177 ± 6203.143 ops/s HpackDecoderBenchmark.decode false true MEDIUM thrpt 20 586026.055 ± 1578.550 ops/s HpackDecoderBenchmark.decode false false SMALL thrpt 20 1775723.270 ± 4932.057 ops/s HpackDecoderBenchmark.decode false false MEDIUM thrpt 20 493316.467 ± 1453.037 ops/s HpackDecoderBenchmark.decode false false LARGE thrpt 10 85726.219 ± 402.573 ops/s After(ints): Benchmark (limitToAscii) (sensitive) (size) Mode Cnt Score Error Units HpackDecoderBenchmark.decode true true SMALL thrpt 20 615549.006 ± 5282.283 ops/s HpackDecoderBenchmark.decode true true MEDIUM thrpt 20 86714.630 ± 654.489 ops/s HpackDecoderBenchmark.decode true true LARGE thrpt 20 3984.439 ± 61.612 ops/s HpackDecoderBenchmark.decode true false SMALL thrpt 20 602489.337 ± 5397.024 ops/s HpackDecoderBenchmark.decode true false MEDIUM thrpt 20 88399.109 ± 241.115 ops/s HpackDecoderBenchmark.decode true false LARGE thrpt 20 3875.729 ± 103.057 ops/s HpackDecoderBenchmark.decode false true SMALL thrpt 20 2092165.454 ± 11918.859 ops/s HpackDecoderBenchmark.decode false true MEDIUM thrpt 20 583465.437 ± 5452.115 ops/s HpackDecoderBenchmark.decode false true LARGE thrpt 20 93290.061 ± 665.904 ops/s HpackDecoderBenchmark.decode false false SMALL thrpt 20 1758402.495 ± 14677.438 ops/s HpackDecoderBenchmark.decode false false MEDIUM thrpt 10 491598.099 ± 5029.698 ops/s HpackDecoderBenchmark.decode false false LARGE thrpt 20 85834.290 ± 554.915 ops/s ```	2019-07-02 20:09:44 +02:00
root	5b58b8e6b5	[maven-release-plugin] prepare for next development iteration	2019-06-28 05:57:21 +00:00
root	35e0843376	[maven-release-plugin] prepare release netty-4.1.37.Final	2019-06-28 05:56:28 +00:00
jimin	856f1185e1	All override methods must be added @override (#9285 ) Motivation: Some methods that either override others or are implemented as part of implementation an interface did miss the `@Override` annotation Modifications: Add missing `@Override`s Result: Code cleanup	2019-06-27 13:51:26 +02:00
Alex Blewitt	52169cba95	Replace accumulation with blackhole.consume (#9275 ) Motivation: SpotJMHBugs reports that accumulating a value as a way of eliding dead code elimination may be inadvisable, as discussed in `JMHSample_34_SafeLooping::measureWrong_2`. Change the test so that it consumes the response with `Blackhole::consume` instead. Modifications: - Replace addition of results with explicit `blackhole.consume()` call Result: Tests work as before, but with different benchmark numbers.	2019-06-25 21:47:07 +02:00
Francesco Nigro	672fa0c779	Documented non-usage of BlackHole::consume on ByteBufAccessBenchmark (#9279 ) Motivation: Some JMH benchmarks need additional explanations to motivate specific code choices. Modifications: Introduced comment to explai why calling BlackHole::consume in a loop is not always the right choice for some benchmark. Result: The relevant method shows a comment that warn about changing the code to introduce BlackHole::consume in the loop.	2019-06-25 14:52:21 +02:00
Alex Blewitt	430eeee2f6	Return the result of the list.recycle() call (#9264 ) Motivation: Resolve the issue highlighted by SpotJMHBugs that the creation of the RecyclableArrayList may be elided by the JIT since the result isn't consumed or returned. Modifications: Return the result of `list.recycle()` so that the list isn't elided. Result: The JMH benchmark shows a change in performance indicating that the prior results of this may be unsound.	2019-06-22 07:22:15 +02:00
Carl Mastrangelo	9abeaf16fd	Properly debounce wakeups (#9191 ) Motivation: The wakeup logic in EpollEventLoop is overly complex Modification: * Simplify the race to wakeup the loop * Dont let the event loop wake up itself (it's already awake!) * Make event loop check if there are any more tasks after preparing to sleep. There is small window where the non-eventloop writers can issue eventfd writes here, but that is okay. Result: Cleaner wakeup logic. Benchmarks: ``` BEFORE Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.executeMulti thrpt 20 408381.411 ± 2857.498 ops/s EpollSocketChannelBenchmark.executeSingle thrpt 20 157022.360 ± 1240.573 ops/s EpollSocketChannelBenchmark.pingPong thrpt 20 60571.704 ± 331.125 ops/s Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.executeMulti thrpt 20 440546.953 ± 1652.823 ops/s EpollSocketChannelBenchmark.executeSingle thrpt 20 168114.751 ± 1176.609 ops/s EpollSocketChannelBenchmark.pingPong thrpt 20 61231.878 ± 520.108 ops/s ```	2019-06-04 05:17:23 -07:00
Nick Hill	2ca526fac6	Ensure "full" ownership of msgs passed to EmbeddedChannel.writeInbound() (#9058 ) Motivation Pipeline handlers are free to "take control" of input buffers if they have singular refcount - in particular to mutate their raw data if non-readonly via discarding of read bytes, etc. However there are various places (primarily unit tests) where a wrapped byte-array buffer is passed in and the wrapped array is assumed not to change (used after the wrapped buffer is passed to EmbeddedChannel.writeInbound()). This invalid assumption could result in unexpected errors, such as those exposed by #8931. Modifications Anywhere that the data passed to writeInbound() might be used again, ensure that either: - A copy is used rather than wrapping a shared byte array, or - The buffer is otherwise protected from modification by making it read-only For the tests, copying is preferred since it still allows the "mutating" optimizations to be exercised. Results Avoid possible errors when pipeline assumes it has full control of input buffer.	2019-05-22 12:08:49 +02:00
root	ba06eafa1c	[maven-release-plugin] prepare for next development iteration	2019-04-30 16:42:29 +00:00
root	49a451101c	[maven-release-plugin] prepare release netty-4.1.36.Final	2019-04-30 16:41:28 +00:00
root	baab215f66	[maven-release-plugin] prepare for next development iteration	2019-04-17 07:26:24 +00:00
root	dfe657e2d4	[maven-release-plugin] prepare release netty-4.1.35.Final	2019-04-17 07:25:40 +00:00
Francesco Nigro	fb50847e39	The benchmark is not taking into account nanoTime granularity (#9033 ) Motivation: Results are just wrong for small delays. Modifications: Switching to AvarageTime avoid to rely on OS nanoTime granularity. Result: Uncontended low delay results are not reliable	2019-04-15 15:14:36 +02:00
Norman Maurer	8f7ef1cabb	Skip execution of ChannelHandler method if annotated with @Skip and … (#8988 ) Motivation: Invoking ChannelHandlers is not free and can result in some overhead when the ChannelPipeline becomes very long. This is especially true if most handlers will just forward the call to the next handler in the pipeline. When the user extends ChannelHandlerAdapter we can easily detect if can just skip the handler and invoke the next handler in the pipeline directly. This reduce the overhead of dispatch but also reduce the call-stack in many cases. This backports https://github.com/netty/netty/pull/8723 and https://github.com/netty/netty/pull/8987 to 4.1 Modifications: Detect if we can skip the handler when walking the pipeline. Result: Reduce overhead for long pipelines. Benchmark (extraHandlers) Mode Cnt Score Error Units DefaultChannelPipelineBenchmark.propagateEventOld 4 thrpt 10 267313.031 ± 9131.140 ops/s DefaultChannelPipelineBenchmark.propagateEvent 4 thrpt 10 824825.673 ± 12727.594 ops/s	2019-04-09 09:36:52 +02:00
root	92b19cfedd	[maven-release-plugin] prepare for next development iteration	2019-03-08 08:55:45 +00:00
root	ff7a9fa091	[maven-release-plugin] prepare release netty-4.1.34.Final	2019-03-08 08:51:34 +00:00
Norman Maurer	14ef469f31	Use maven plugin to prevent API/ABI breakage as part of build process (#8904 ) Motivation: Netty is very widely used which can lead to a lot of pain when we break API / ABI. We should make use japicmp-maven-plugin during the build to verify we do not introduce breakage by mistake. Modifications: - Add japicmp-maven-plugin to the build process - Fix a method signature change in HttpProxyHandler that was flagged as a possible problem. Result: Ensure no API/ABI breakage accour between releases.	2019-03-01 19:42:29 +01:00
Nick Hill	0811409ca3	Further reduce ensureAccessible() overhead (#8895 ) Motivation: This PR fixes some non-negligible overhead discovered in the ByteBuf accessibility (non-zero refcount) checking. The cause turned out to be mostly twofold: - Unnecessary operations used to calculate the refcount from the "raw" encoded int field value - Call stack depths exceeding the default limit for inlining, in some places (CompositeByteBuf in particular) It's a follow-on from #8882 which uses the maxCapacity field for a simpler non-negative check. The performance gap between these two variants appears to be _mostly_ closed, but there's one exception which may warrant further analysis. Modifications: - Replace ABB.internalRefCount() with ByteBuf.isAccessible(), the default still checks for non-zero refCnt() - Just test for parity of raw refCnt instead of converting to "real", with fast-path for specific small values - Make sure isAccessible() is delegated by derived/wrapper ByteBufs - Use existing freed flag in CompositeByteBuf for faster isAccessible() - Manually inline some calls in methods like CompositeByteBuf.setLong() and AbstractReferenceCountedByteBuf.isAccessible() to reduce stack depths (to ensure default inlining limit isn't hit) - Add ByteBufAccessBenchmark which is an extension of UnsafeByteBufBenchmark (maybe latter could now be removed) Results: Before: Benchmark (bufferType) (checkAccessible) (checkBounds) Mode Cnt Score Error Units readBatch UNSAFE true true thrpt 30 84524972.863 ± 518338.811 ops/s readBatch UNSAFE_SLICE true true thrpt 30 38608795.037 ± 298176.974 ops/s readBatch HEAP true true thrpt 30 80003697.649 ± 974674.119 ops/s readBatch COMPOSITE true true thrpt 30 18495554.788 ± 108075.023 ops/s setGetLong UNSAFE true true thrpt 30 247069881.578 ± 10839162.593 ops/s setGetLong UNSAFE_SLICE true true thrpt 30 196355905.206 ± 1802420.990 ops/s setGetLong HEAP true true thrpt 30 245686644.713 ± 11769311.527 ops/s setGetLong COMPOSITE true true thrpt 30 83170940.687 ± 657524.123 ops/s setLong UNSAFE true true thrpt 30 278940253.918 ± 1807265.259 ops/s setLong UNSAFE_SLICE true true thrpt 30 202556738.764 ± 11887973.563 ops/s setLong HEAP true true thrpt 30 280045958.053 ± 2719583.400 ops/s setLong COMPOSITE true true thrpt 30 121299806.002 ± 2155084.707 ops/s After: Benchmark (bufferType) (checkAccessible) (checkBounds) Mode Cnt Score Error Units readBatch UNSAFE true true thrpt 30 101641801.035 ± 3950050.059 ops/s readBatch UNSAFE_SLICE true true thrpt 30 84395902.846 ± 4339579.057 ops/s readBatch HEAP true true thrpt 30 100179060.207 ± 3222487.287 ops/s readBatch COMPOSITE true true thrpt 30 42288494.472 ± 294919.633 ops/s setGetLong UNSAFE true true thrpt 30 304530755.027 ± 6574163.899 ops/s setGetLong UNSAFE_SLICE true true thrpt 30 212028547.645 ± 14277828.768 ops/s setGetLong HEAP true true thrpt 30 309335422.609 ± 2272150.415 ops/s setGetLong COMPOSITE true true thrpt 30 160383609.236 ± 966484.033 ops/s setLong UNSAFE true true thrpt 30 298055969.747 ± 7437449.627 ops/s setLong UNSAFE_SLICE true true thrpt 30 223784178.650 ± 9869750.095 ops/s setLong HEAP true true thrpt 30 302543263.328 ± 8140104.706 ops/s setLong COMPOSITE true true thrpt 30 157083673.285 ± 3528779.522 ops/s There's also a similar knock-on improvement to other benchmarks (e.g. HPACK encoding/decoding) as shown in #8882. For sanity I did a final comparison of the "fast path" tweak using one of the HPACK benchmarks: (rawCnt & 1) == 0: Benchmark (limitToAscii) (sensitive) (size) Mode Cnt Score Error Units HpackDecoderBenchmark.decode true true MEDIUM thrpt 30 50914.479 ± 940.114 ops/s rawCnt == 2 \|\| rawCnt == 4 \|\| rawCnt == 6 \|\| rawCnt == 8 \|\| (rawCnt & 1) == 0: Benchmark (limitToAscii) (sensitive) (size) Mode Cnt Score Error Units HpackDecoderBenchmark.decode true true MEDIUM thrpt 30 60036.425 ± 1478.196 ops/s	2019-02-28 20:40:41 +01:00
Dmitriy Dumanskiy	b72fea340b	Improve DateFormatter parsing performance (#8821 ) Motivation: Just was looking through code and found 1 interesting place DateFormatter.tryParseMonth that was not very effective, so I decided to optimize it a bit. Modification: Changed DateFormatter.tryParseMonth method. Instead of invocation regionMatch() for every month - compare chars one by one. Result: DateFormatter.parseHttpDate method performance improved from ~3% to ~15%. Benchmark (DATE_STRING) Mode Cnt Score Error Units DateFormatter2Benchmark.parseHttpHeaderDateFormatter Sun, 27 Jan 2016 19:18:46 GMT thrpt 6 4142781.221 ± 82155.002 ops/s DateFormatter2Benchmark.parseHttpHeaderDateFormatter Sun, 27 Dec 2016 19:18:46 GMT thrpt 6 3781810.558 ± 38679.061 ops/s DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew Sun, 27 Jan 2016 19:18:46 GMT thrpt 6 4372569.705 ± 30257.537 ops/s DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew Sun, 27 Dec 2016 19:18:46 GMT thrpt 6 4339785.100 ± 57542.660 ops/s	2019-02-04 10:04:20 +01:00
Norman Maurer	cd3254df88	Update to new checkstyle plugin (#8777 ) (#8780 ) Motivation: We need to update to a new checkstyle plugin to allow the usage of lambdas. Modifications: - Update to new plugin version. - Fix checkstyle problems. Result: Be able to use checkstyle plugin which supports new Java syntax.	2019-01-25 11:58:42 +01:00
root	cf03ed0478	[maven-release-plugin] prepare for next development iteration	2019-01-21 12:26:44 +00:00
root	37484635cb	[maven-release-plugin] prepare release netty-4.1.33.Final	2019-01-21 12:26:12 +00:00
Francesco Nigro	b8a3394f9b	Adding an execute burst cost benchmark for Netty executors (#8594 ) Motivation: Netty executors doesn't have yet any means to compare with each others nor to compare with the j.u.c. executors Modifications: A new benchmark measuring execute burst cost is being added Result: It's now possible to compare some of Netty executors with each others and with the j.u.c. executors	2018-12-04 15:46:25 +01:00
root	8eb313072e	[maven-release-plugin] prepare for next development iteration	2018-11-29 11:15:09 +00:00
root	afcb4a37d3	[maven-release-plugin] prepare release netty-4.1.32.Final	2018-11-29 11:14:20 +00:00
Nick Hill	10539f4dc7	Streamline CompositeByteBuf internals (#8437 ) Motivation: CompositeByteBuf is a powerful and versatile abstraction, allowing for manipulation of large data without copying bytes. There is still a non-negligible cost to reading/writing however relative to "singular" ByteBufs, and this can be mostly eliminated with some rework of the internals. My use case is message modification/transformation while zero-copy proxying. For example replacing a string within a large message with one of a different length Modifications: - No longer slice added buffers and unwrap added slices - Components store target buf offset relative to position in composite buf - Less allocations, object footprint, pointer indirection, offset arithmetic - Use Component[] rather than ArrayList<Component> - Avoid pointer indirection and duplicate bounds check, more efficient backing array growth - Facilitates optimization when doing bulk-inserts - inserting n ByteBufs behind m is now O(m + n) instead of O(mn) - Avoid unnecessary casting and method call indirection via superclass - Eliminate some duplicate range/ref checks via non-checking versions of toComponentIndex and findComponent - Add simple fast-path for toComponentIndex(0); add racy cache of last-accessed Component to findComponent(int) - Override forEachByte0(...) and forEachByteDesc0(...) methods - Make use of RecyclableArrayList in nioBuffers(int, int) (in line with FasterCompositeByteBuf impl) - Modify addComponents0(boolean,int,Iterable) to use the Iterable directly rather than copy to an array first (and possibly to an ArrayList before that) - Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform repeated array insertions and avoid second loop for offset updates - Simplify other logic in various places, in particular the general pattern used where a sub-range is iterated over - Add benchmarks to demonstrate some improvements While refactoring I also came across a couple of clear bugs. They are fixed in these changes but I will open another PR with unit tests and fixes to the current version. Result: Much faster creation, manipulation, and access; many fewer allocations and smaller footprint. Benchmark results to follow.	2018-11-03 10:37:07 +01:00
root	3e7ddb36c7	[maven-release-plugin] prepare for next development iteration	2018-10-29 15:38:51 +00:00
root	9e50739601	[maven-release-plugin] prepare release netty-4.1.31.Final	2018-10-29 15:37:47 +00:00
Nick Hill	583d838f7c	Optimize AbstractByteBuf.getCharSequence() in US_ASCII case (#8392 ) * Optimize AbstractByteBuf.getCharSequence() in US_ASCII case Motivation: Inspired by https://github.com/netty/netty/pull/8388, I noticed this simple optimization to avoid char[] allocation (also suggested in a TODO here). Modifications: Return an AsciiString from AbstractByteBuf.getCharSequence() if requested charset is US_ASCII or ISO_8859_1 (latter thanks to @Scottmitch's suggestion). Also tweak unit tests not to require Strings and include a new benchmark to demonstrate the speedup. Result: Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1 cases	2018-10-26 15:32:38 -07:00
Norman Maurer	87ec2f882a	Reduce overhead by ByteBufUtil.decodeString(...) which is used by `AbstractByteBuf.toString(...)` and `AbstractByteBuf.getCharSequence(...)` (#8388 ) Motivation: Our current implementation that is used for toString(Charset) operations on AbstractByteBuf implementation is quite slow as it does a lot of uncessary memory copies. We should just use new String(...) as it has a lot of optimizations to handle these cases. Modifications: Rewrite ByteBufUtil.decodeString(...) to use new String(...) Result: Less overhead for toString(Charset) operations. Benchmark (charsetName) (direct) (size) Mode Cnt Score Error Units ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII false 8 thrpt 20 22401645.093 ? 4671452.479 ops/s ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII false 64 thrpt 20 23678483.384 ? 3749164.446 ops/s ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII true 8 thrpt 20 15731142.651 ? 3782931.591 ops/s ByteBufUtilDecodeStringBenchmark.decodeString US-ASCII true 64 thrpt 20 16244232.229 ? 1886259.658 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 false 8 thrpt 20 25983680.959 ? 5045782.289 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 false 64 thrpt 20 26235589.339 ? 2867004.950 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 true 8 thrpt 20 18499027.808 ? 4784684.268 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-8 true 64 thrpt 20 16825286.141 ? 1008712.342 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 false 8 thrpt 20 5789879.092 ? 1201786.359 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 false 64 thrpt 20 2173243.225 ? 417809.341 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 true 8 thrpt 20 5035583.011 ? 1001978.854 ops/s ByteBufUtilDecodeStringBenchmark.decodeString UTF-16 true 64 thrpt 20 2162345.301 ? 402410.408 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 false 8 thrpt 20 30039052.376 ? 6539111.622 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 false 64 thrpt 20 31414163.515 ? 2096710.526 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 true 8 thrpt 20 19538587.855 ? 4639115.572 ops/s ByteBufUtilDecodeStringBenchmark.decodeString ISO-8859-1 true 64 thrpt 20 19467839.722 ? 1672687.213 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII false 8 thrpt 20 10787326.745 ? 1034197.864 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII false 64 thrpt 20 7129801.930 ? 1363019.209 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII true 8 thrpt 20 9002529.605 ? 2017642.445 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld US-ASCII true 64 thrpt 20 3860192.352 ? 826218.738 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 false 8 thrpt 20 10532838.027 ? 2151743.968 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 false 64 thrpt 20 7185554.597 ? 1387685.785 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 true 8 thrpt 20 7352253.316 ? 1333823.850 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-8 true 64 thrpt 20 2825578.707 ? 349701.156 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 false 8 thrpt 20 7277446.665 ? 1447034.346 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 false 64 thrpt 20 2445929.579 ? 562816.641 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 true 8 thrpt 20 6201174.401 ? 1236137.786 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld UTF-16 true 64 thrpt 20 2310674.973 ? 525587.959 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 false 8 thrpt 20 11142625.392 ? 1680556.468 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 false 64 thrpt 20 8127116.405 ? 1128513.860 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 true 8 thrpt 20 9405751.952 ? 2193324.806 ops/s ByteBufUtilDecodeStringBenchmark.decodeStringOld ISO-8859-1 true 64 thrpt 20 3943282.076 ? 737798.070 ops/s Benchmark result is saved to /home/norman/mainframer/netty/microbench/target/reports/performance/ByteBufUtilDecodeStringBenchmark.json Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,030.173 sec - in io.netty.buffer.ByteBufUtilDecodeStringBenchmark [1030.460s][info ][gc,heap,exit ] Heap [1030.460s][info ][gc,heap,exit ] garbage-first heap total 516096K, used 257918K [0x0000000609a00000, 0x0000000800000000) [1030.460s][info ][gc,heap,exit ] region size 2048K, 127 young (260096K), 2 survivors (4096K) [1030.460s][info ][gc,heap,exit ] Metaspace used 17123K, capacity 17438K, committed 17792K, reserved 1064960K [1030.460s][info ][gc,heap,exit ] class space used 1709K, capacity 1827K, committed 1920K, reserved 1048576K	2018-10-19 14:00:13 +02:00
root	2d7cb47edd	[maven-release-plugin] prepare for next development iteration	2018-09-27 19:00:45 +00:00
root	3a9ac829d5	[maven-release-plugin] prepare release netty-4.1.30.Final	2018-09-27 18:56:12 +00:00
Norman Maurer	e542a2cf26	Use a non-volatile read for ensureAccessible() whenever possible to reduce overhead and allow better inlining. (#8266 ) Motiviation: At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives. It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here. The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check. Modifications: - Try to do a non-volatile read using sun.misc.Unsafe if we can use it. - Add a benchmark Result: Big performance win when multiple ByteBuf methods are called from a method. With the change: UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 281395842,128 ± 5050792,296 ops/s Before the change: UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf thrpt 20 217419832,801 ± 5080579,030 ops/s	2018-09-07 07:47:02 +02:00
Norman Maurer	052c2fbefe	Update to jmh 1.2.1 (#8270 ) Motivation: We should use the latest jmh version which also supports -prof dtraceasm on MacOS. Modifications: Update to latest jmh version. Result: Better benchmark / profiling support on MacOS.	2018-09-06 22:31:52 +02:00
Norman Maurer	02d559e6a4	Remove flags when running benchmarks. (#8262 ) Motivation: Some of the flags we used are not supported anymore on more recent JDK versions. We should just remove all of them and only keep what we really need. This may also reflect better what people use in production. Modifications: Remove some flags when running the benchmarks. Result: Benchmarks also run with JDK11.	2018-09-05 19:05:02 +02:00
Norman Maurer	8635d88d4d	Allow to generate a jmh uber jar to run benchmarks easily from cmdline with different arguments. (#8264 ) Motivation: It is sometimes useful to be able to run benchmarks easily from the commandline and passs different arguments / options here. We should support this. Modifications: Add the benchmark-jar profile which allows to generate such an "uber-jar" that can be used directly to run benchmarks as documented at http://openjdk.java.net/projects/code-tools/jmh/. Result: More flexible way to run benchmarks.	2018-09-05 18:28:35 +02:00
Carl Mastrangelo	379a56ca49	Add an Epoll benchmark Motivation: Optimizing the Epoll channel needs an objective measure of how fast it is. Modification: Add a simple, closed loop, ping-pong benchmark. Result: Benchmark can be used to measure #7816 Initial numbers: ``` Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong": 22614.403 ±(99.9%) 797.263 ops/s [Average] (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130 CI (99.9%): [21817.140, 23411.666] (assumes normal distribution) Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s ```	2018-09-04 10:15:15 +02:00
Francesco Nigro	c78be33443	Added configurable ByteBuf bounds checking (#7521 ) Motivation: The JVM isn't always able to hoist out/reduce bounds checking (due to ref counting operations etc etc) hence making it configurable could improve performances for most CPU intensive use cases. Modifications: Each AbstractByteBuf bounds check has been tested against a new static final configuration property similar to checkAccessible ie io.netty.buffer.bytebuf.checkBounds. Result: Any user could disable ByteBuf bounds checking in order to get extra performances.	2018-09-03 20:33:47 +02:00
root	a580dc7585	[maven-release-plugin] prepare for next development iteration	2018-08-24 06:36:33 +00:00
root	3fc789e83f	[maven-release-plugin] prepare release netty-4.1.29.Final	2018-08-24 06:36:06 +00:00
root	fcb19cb589	[maven-release-plugin] prepare for next development iteration	2018-07-27 04:59:28 +00:00
root	ff785fbe39	[maven-release-plugin] prepare release netty-4.1.28.Final	2018-07-27 04:59:06 +00:00
root	b4dbdc2036	[maven-release-plugin] prepare for next development iteration	2018-07-11 15:37:40 +00:00
root	1c16519ac8	[maven-release-plugin] prepare release netty-4.1.27.Final	2018-07-11 15:37:21 +00:00
root	7bb9e7eafe	[maven-release-plugin] prepare for next development iteration	2018-07-10 05:21:24 +00:00
root	8ca5421bd2	[maven-release-plugin] prepare release netty-4.1.26.Final	2018-07-10 05:18:13 +00:00
Norman Maurer	83710cb2e1	Replace toArray(new T[size]) with toArray(new T[0]) to eliminate zero-out and allow the VM to optimize. (#8075 ) Motivation: Using toArray(new T[0]) is usually the faster aproach these days. We should use it. See also https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_conclusion. Modifications: Replace toArray(new T[size]) with toArray(new T[0]). Result: Faster code.	2018-06-29 07:56:04 +02:00
unknown	4a8d3a274c	Including the setup code in the benchmark method to avoid JMH Invocation level hiccups. Motivation: The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant overhead in the benchmark time (see org.openjdk.jmh.annotations.Level documentation). In the case of CodecInputListBenchmark, benchmarks are far too small (less than 50ns) and the Invocation level setup offsets the measurement considerably. On such cases, the recommended fix patch is to include the setup/teardown code in the benchmark method. Modifications: Include the setup/teardown code in the relevant benchmark methods. Remove the setup/teardown methods from the benchmark class. Result: We run the entire benchmark 10 times with default parameters we observed: - ArrayList benchmark affected directly by JMH overhead is now from 15-80% faster. - CodecList benchmark is now 50% faster than original (even with the setup code being measured). - Recyclable ArrayList is ~30% slower. - All benchmarks have significant different means (ANOVA) and medians (Moore) Mode: Throughput (Higher the better) Method Full params Factor Modified (Median) Original (Median) recyclableArrayList (elements = 1) 0.615520967 21719082.75 35285691.2 recyclableArrayList (elements = 4) 0.699553431 17149442.76 24514843.31 arrayList (elements = 4) 1.152666631 27120407.18 23528404.88 codecOutList (elements = 1) 1.527275908 67251089.04 44033359.47 codecOutList (elements = 4) 1.596917095 59174088.78 37055204.03 arrayList (elements = 1) 1.878616889 62188238.24 33103204.06 Environment: Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.	2018-06-21 12:22:13 +02:00
unknown	cb420a9ffc	Including the setup code in the benchmark method to avoid JMH Invocation level hiccups. Motivation: The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant impact in in the benchmark time (see org.openjdk.jmh.annotations.Level documentation). When the benchmark and the setup/teardown is too small (less than a milisecond) the Invocation level might saturate the system with timestamp requests and iteration synchronizations which introduce artificial latency, throughput, and scalability bottlenecks. In the HeadersBenchmark, all benchmarks take less than 100ns and the Invocation level setup offsets the measurement considerably. As fixture methods is defined for the entire class, this overhead also impacts every single benchmark in this class, not only the ones that use the emptyHttpHeaders object (cleaned in the setup). The recommended fix patch here is to include the setup/teardown code in the benchmark where the object is used. Modifications: Include the setup/teardown code in the relevant benchmark methods. Remove the setup/teardown method of Invocation level from the benchmark class. Result: We run all benchmarks from HeadersBenchmark 10 times with default parameter, we observe: - Benchmarks that were not directly affected by the fix patch, improved execution time. For instance, http2Remove with (exampleHeader = THREE) had its median reported as 2x faster than the original version. - Benchmarks that had the setup code inserted (eg. http2AddAllFastest) did not suffer a significant punch in the execution time, as the benchmarks are not dominated by the clear(). Environment: Tests run on a Computational server with CPU: E5-1660-3.3GHZ (6 cores + HT), 64 GB RAM.	2018-06-21 12:21:19 +02:00
Norman Maurer	64bb279f47	[maven-release-plugin] prepare for next development iteration	2018-05-14 11:11:45 +00:00
Norman Maurer	c67a3b0507	[maven-release-plugin] prepare release netty-4.1.25.Final	2018-05-14 11:11:24 +00:00
Norman Maurer	b75f44db9a	[maven-release-plugin] prepare for next development iteration	2018-04-19 11:56:07 +00:00
Norman Maurer	04fac00c8c	[maven-release-plugin] prepare release netty-4.1.24.Final	2018-04-19 11:55:47 +00:00
root	0a61f055f5	[maven-release-plugin] prepare for next development iteration	2018-04-04 10:44:46 +00:00
root	8c549bad38	[maven-release-plugin] prepare release netty-4.1.23.Final	2018-04-04 10:44:15 +00:00
Scott Mitchell	9d51a40df0	Update NetUtilBenchmark (#7826 ) Motivation: NetUtilBenchmark is using out of date data, throws an exception in the benchmark, and allocates a Set on each run. Modifications: - Update the benchmark and reduce each run's overhead Result: NetUtilBenchmark is updated.	2018-03-31 08:27:08 +02:00
Francesco Nigro	ed46c4ed00	Copies from read-only heap ByteBuffer to direct ByteBuf can avoid stealth ByteBuf allocation and additional copies Motivation: Read-only heap ByteBuffer doesn't expose array: the existent method to perform copies to direct ByteBuf involves the creation of a (maybe pooled) additional heap ByteBuf instance and copy Modifications: To avoid stressing the allocator with additional (and stealth) heap ByteBuf allocations is provided a method to perform copies using the (pooled) internal NIO buffer Result: Copies from read-only heap ByteBuffer to direct ByteBuf won't create any intermediate ByteBuf	2018-02-27 09:54:21 +09:00
Norman Maurer	69582c0b6c	[maven-release-plugin] prepare for next development iteration	2018-02-21 12:52:33 +00:00
Norman Maurer	786f35c6c9	[maven-release-plugin] prepare release netty-4.1.22.Final	2018-02-21 12:52:19 +00:00
Norman Maurer	e71fa1e7b6	[maven-release-plugin] prepare for next development iteration	2018-02-05 12:02:35 +00:00
Norman Maurer	41ebb5fcca	[maven-release-plugin] prepare release netty-4.1.21.Final	2018-02-05 12:02:19 +00:00
Julien Hoarau	3e6b54bb59	Fix failing h2spec tests 8.1.2.1 related to pseudo-headers validation Motivation: According to the spec: All pseudo-header fields MUST appear in the header block before regular header fields. Any request or response that contains a pseudo-header field that appears in a header block after a regular header field MUST be treated as malformed (Section 8.1.2.6). Pseudo-header fields are only valid in the context in which they are defined. Pseudo-header fields defined for requests MUST NOT appear in responses; pseudo-header fields defined for responses MUST NOT appear in requests. Pseudo-header fields MUST NOT appear in trailers. Endpoints MUST treat a request or response that contains undefined or invalid pseudo-header fields as malformed (Section 8.1.2.6). Clients MUST NOT accept a malformed response. Note that these requirements are intended to protect against several types of common attacks against HTTP; they are deliberately strict because being permissive can expose implementations to these vulnerabilities. Modifications: - Introduce validation in HPackDecoder Result: - Requests with unknown pseudo-field headers are rejected - Requests with containing response specific pseudo-headers are rejected - Requests where pseudo-header appear after regular header are rejected - h2spec 8.1.2.1 pass	2018-01-29 19:42:56 -08:00
Norman Maurer	4c1e0f596a	Use FastThreadLocal for CodecOutputList Motivation: We used Recycler for the CodecOutputList which is not optimized for the use-case of access only from the same Thread all the time. Modifications: - Use FastThreadLocal for CodecOutputList - Add benchmark Result: Less overhead in our codecs.	2018-01-23 11:34:28 +01:00
Norman Maurer	ea58dc7ac7	[maven-release-plugin] prepare for next development iteration	2018-01-21 12:53:51 +00:00
Norman Maurer	96c7132dee	[maven-release-plugin] prepare release netty-4.1.20.Final	2018-01-21 12:53:34 +00:00
Francesco Nigro	1cf2687244	Fixed JMH ByteBuf benchmark to avoid dead code elimination Motivation: The JMH doc suggests to use BlackHoles to avoid dead code elimination hence would be better to follow this best practice. Modifications: Each benchmark method is returning the ByteBuf/ByteBuffer to avoid the JVM to perform any dead code elimination. Result: The results are more reliable and comparable to the others provided by other ByteBuf benchmarks (eg HeapByteBufBenchmark)	2017-12-19 14:09:18 +01:00
Scott Mitchell	55ef09f191	Add HttpObjectEncoderBenchmark Motivation: Benchmark to measure HttpObjectEncoder performance. Modifications: - Create new benchmark HttpObjectEncoderBenchmark Result: JMH Microbenchmark for HttpObjectEncoder.	2017-12-16 13:47:34 +01:00
Scott Mitchell	5f0342ebe0	Add RedisEncoderBenchmark Motivation: Add a benchmark to measure RedisEncoder's performance Modifications: - Add RedisEncoderBenchmark Result: JMH benchmark exists to measure RedisEncoder's performance.	2017-12-16 13:42:50 +01:00
Norman Maurer	264a5daa41	[maven-release-plugin] prepare for next development iteration	2017-12-15 13:10:54 +00:00
Norman Maurer	0786c4c8d9	[maven-release-plugin] prepare release netty-4.1.19.Final	2017-12-15 13:09:30 +00:00
Norman Maurer	b2bc6407ab	[maven-release-plugin] prepare for next development iteration	2017-12-08 09:26:15 +00:00
Norman Maurer	96732f47d8	[maven-release-plugin] prepare release netty-4.1.18.Final	2017-12-08 09:25:56 +00:00
Scott Mitchell	93b144b7b4	HttpMethod#valueOf improvement Motivation: HttpMethod#valueOf shows up on profiler results in the top set of results. Since it is a relatively simple operation it can be improved in isolation. Modifications: - Introduce a special case map which assigns each HttpMethod to a unique index in an array and provides constant time lookup from a hash code algorithm. When the bucket is matched we can then directly do equality comparison instead of potentially following a linked structure when HashMap has hash collisions. Result: ~10% improvement in benchmark results for HttpMethod#valueOf Benchmark Mode Cnt Score Error Units HttpMethodMapBenchmark.newMapKnownMethods thrpt 16 31.831 ± 0.928 ops/us HttpMethodMapBenchmark.newMapMixMethods thrpt 16 25.568 ± 0.400 ops/us HttpMethodMapBenchmark.newMapUnknownMethods thrpt 16 51.413 ± 1.824 ops/us HttpMethodMapBenchmark.oldMapKnownMethods thrpt 16 29.226 ± 0.330 ops/us HttpMethodMapBenchmark.oldMapMixMethods thrpt 16 21.073 ± 0.247 ops/us HttpMethodMapBenchmark.oldMapUnknownMethods thrpt 16 49.081 ± 0.577 ops/us	2017-11-20 11:07:50 -08:00
Scott Mitchell	e6126215e0	DefaultHttp2FrameWriter reduce object allocation Motivation: DefaultHttp2FrameWriter#writeData allocates a DataFrameHeader for each write operation. DataFrameHeader maintains internal state and allocates multiple slices of a buffer which is a maximum of 30 bytes. This 30 byte buffer may not always be necessary and the additional slice operations can utilize retainedSlice to take advantage of pooled objects. We can also save computation and object allocations if there is no padding which is a common case in practice. Modifications: - Remove DataFrameHeader - Add a fast path for padding == 0 Result: Less object allocation in DefaultHttp2FrameWriter	2017-11-20 08:10:59 -08:00
Anuraag Agrawal	1f1a60ae7d	Use Netty's DefaultPriorityQueue instead of JDK's PriorityQueue for scheduled tasks Motivation: `AbstractScheduledEventExecutor` uses a standard `java.util.PriorityQueue` to keep track of task deadlines. `ScheduledFuture.cancel` removes tasks from this `PriorityQueue`. Unfortunately, `PriorityQueue.remove` has `O(n)` performance since it must search for the item in the entire queue before removing it. This is fast when the future is at the front of the queue (e.g., already triggered) but not when it's randomly located in the queue. Many servers will use `ScheduledFuture.cancel` on all requests, e.g., to manage a request timeout. As these cancellations will be happen in arbitrary order, when there are many scheduled futures, `PriorityQueue.remove` is a bottleneck and greatly hurts performance with many concurrent requests (>10K). Modification: Use netty's `DefaultPriorityQueue` for scheduling futures instead of the JDK. `DefaultPriorityQueue` is almost identical to the JDK version except it is able to remove futures without searching for them in the queue. This means `DefaultPriorityQueue.remove` has `O(log n)` performance. Result: Before - cancelling futures has varying performance, capped at `O(n)` After - cancelling futures has stable performance, capped at `O(log n)` Benchmark results After - cancelling in order and in reverse order have similar performance within `O(log n)` bounds ``` Benchmark (num) Mode Cnt Score Error Units ScheduledFutureTaskBenchmark.cancelInOrder 100 thrpt 20 137779.616 ± 7709.751 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 1000 thrpt 20 11049.448 ± 385.832 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 10000 thrpt 20 943.294 ± 12.391 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 100000 thrpt 20 64.210 ± 1.824 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100 thrpt 20 167531.096 ± 9187.865 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 1000 thrpt 20 33019.786 ± 4737.770 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 10000 thrpt 20 2976.955 ± 248.555 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100000 thrpt 20 362.654 ± 45.716 ops/s ``` Before - cancelling in order and in reverse order have significantly different performance at higher queue size, orders of magnitude worse than the new implementation. ``` Benchmark (num) Mode Cnt Score Error Units ScheduledFutureTaskBenchmark.cancelInOrder 100 thrpt 20 139968.586 ± 12951.333 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 1000 thrpt 20 12274.420 ± 337.800 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 10000 thrpt 20 958.168 ± 15.350 ops/s ScheduledFutureTaskBenchmark.cancelInOrder 100000 thrpt 20 53.381 ± 13.981 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100 thrpt 20 123918.829 ± 3642.517 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 1000 thrpt 20 5099.810 ± 206.992 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 10000 thrpt 20 72.335 ± 0.443 ops/s ScheduledFutureTaskBenchmark.cancelInReverseOrder 100000 thrpt 20 0.743 ± 0.003 ops/s ```	2017-11-10 23:09:32 -08:00
Norman Maurer	188ea59c9d	[maven-release-plugin] prepare for next development iteration	2017-11-08 22:36:53 +00:00
Norman Maurer	812354cf1f	[maven-release-plugin] prepare release netty-4.1.17.Final	2017-11-08 22:36:33 +00:00
Carl Mastrangelo	83a19d5650	Optimistically update ref counts Motivation: Highly retained and released objects have contention on their ref count. Currently, the ref count is updated using compareAndSet with care to make sure the count doesn't overflow, double free, or revive the object. Profiling has shown that a non trivial (~1%) of CPU time on gRPC latency benchmarks is from the ref count updating. Modification: Rather than pessimistically assuming the ref count will be invalid, optimistically update it assuming it will be. If the update was wrong, then use the slow path to revert the change and throw an execption. Most of the time, the ref counts are correct. This changes from using compareAndSet to getAndAdd, which emits a different CPU instruction on x86 (CMPXCHG to XADD). Because the CPU knows it will modifiy the memory, it can avoid contention. On a highly contended machine, this can be about 2x faster. There is a downside to the new approach. The ref counters can temporarily enter invalid states if over retained or over released. The code does handle these overflow and underflow scenarios, but it is possible that another concurrent access may push the failure to a different location. For example: Time 1 Thread 1: obj.retain(INT_MAX - 1) Time 2 Thread 1: obj.retain(2) Time 2 Thread 2: obj.retain(1) Previously Thread 2 would always succeed and Thread 1 would always fail on the second access. Now, thread 2 could fail while thread 1 is rolling back its change. ==== There are a few reasons why I think this is okay: 1. Buggy code is going to have bugs. An exception _is_ going to be thrown. This just causes the other threads to notice the state is messed up and stop early. 2. If high retention counts are a use case, then ref count should be a long rather than an int. 3. The critical section is greatly reduced compared to the previous version, so the likelihood of this happening is lower 4. On error, the code always rollsback the change atomically, so there is no possibility of corruption. Result: Faster refcounting ``` BEFORE: Benchmark (delay) Mode Cnt Score Error Units AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1 sample 2901361 804.579 ± 1.835 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10 sample 3038729 785.376 ± 16.471 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 100 sample 2899401 817.392 ± 6.668 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1000 sample 3650566 2077.700 ± 0.600 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10000 sample 3005467 19949.334 ± 4.243 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1 sample 456091 48.610 ± 1.162 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10 sample 732051 62.599 ± 0.815 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 100 sample 778925 228.629 ± 1.205 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1000 sample 633682 2002.987 ± 2.856 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10000 sample 506442 19735.345 ± 12.312 ns/op AFTER: Benchmark (delay) Mode Cnt Score Error Units AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1 sample 3761980 383.436 ± 1.315 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10 sample 3667304 474.429 ± 1.101 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 100 sample 3039374 479.267 ± 0.435 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 1000 sample 3709210 2044.603 ± 0.989 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_contended 10000 sample 3011591 19904.227 ± 18.025 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1 sample 494975 52.269 ± 8.345 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10 sample 771094 62.290 ± 0.795 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 100 sample 763230 235.044 ± 1.552 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 1000 sample 634037 2006.578 ± 3.574 ns/op AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended 10000 sample 506284 19742.605 ± 13.729 ns/op ```	2017-10-04 08:42:33 +02:00
Norman Maurer	625a7426cd	[maven-release-plugin] prepare for next development iteration	2017-09-25 06:12:32 +02:00
Norman Maurer	f57d8f00e1	[maven-release-plugin] prepare release netty-4.1.16.Final	2017-09-25 06:12:16 +02:00
Norman Maurer	3c8c7fc7e9	Reduce performance overhead of ResourceLeakDetector Motiviation: The ResourceLeakDetector helps to detect and troubleshoot resource leaks and is often used even in production enviroments with a low level. Because of this its import that we try to keep the overhead as low as overhead. Most of the times no leak is detected (as all is correctly handled) so we should keep the overhead for this case as low as possible. Modifications: - Only call getStackTrace() if a leak is reported as it is a very expensive native call. Also handle the filtering and creating of the String in a lazy fashion - Remove the need to mantain a Queue to store the last access records - Add benchmark Result: Huge decrease of performance overhead. Before the patch: Benchmark (recordTimes) Mode Cnt Score Error Units ResourceLeakDetectorRecordBenchmark.record 8 thrpt 20 4358.367 ± 116.419 ops/s ResourceLeakDetectorRecordBenchmark.record 16 thrpt 20 2306.027 ± 55.044 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 8 thrpt 20 4220.979 ± 114.046 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 16 thrpt 20 2250.734 ± 55.352 ops/s With this patch: Benchmark (recordTimes) Mode Cnt Score Error Units ResourceLeakDetectorRecordBenchmark.record 8 thrpt 20 71398.957 ± 2695.925 ops/s ResourceLeakDetectorRecordBenchmark.record 16 thrpt 20 38643.963 ± 1446.694 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 8 thrpt 20 71677.882 ± 2923.622 ops/s ResourceLeakDetectorRecordBenchmark.recordWithHint 16 thrpt 20 38660.176 ± 1467.732 ops/s	2017-09-18 16:36:19 -07:00
Norman Maurer	b967805f32	[maven-release-plugin] prepare for next development iteration	2017-08-24 15:38:22 +02:00
Norman Maurer	da8e010a42	[maven-release-plugin] prepare release netty-4.1.15.Final	2017-08-24 15:37:59 +02:00
Norman Maurer	52f384b37f	[maven-release-plugin] prepare for next development iteration	2017-08-02 12:55:10 +00:00
Norman Maurer	8cc1071881	[maven-release-plugin] prepare release netty-4.1.14.Final	2017-08-02 12:54:51 +00:00
Nikolay Fedorovskikh	df568c739e	Use ByteBuf#writeShort/writeMedium instead of writeBytes Motivation: 1. Some encoders used a `ByteBuf#writeBytes` to write short constant byte array (2-3 bytes). This can be replaced with more faster `ByteBuf#writeShort` or `ByteBuf#writeMedium` which do not access the memory. 2. Two chained calls of the `ByteBuf#setByte` with constants can be replaced with one `ByteBuf#setShort` to reduce index checks. 3. The signature of method `HttpHeadersEncoder#encoderHeader` has an unnecessary `throws`. Modifications: 1. Use `ByteBuf#writeShort` or `ByteBuf#writeMedium` instead of `ByteBuf#writeBytes` for the constants. 2. Use `ByteBuf#setShort` instead of chained call of the `ByteBuf#setByte` with constants. 3. Remove an unnecessary `throws` from `HttpHeadersEncoder#encoderHeader`. Result: A bit faster writes constants into buffers.	2017-07-10 14:37:41 +02:00
Norman Maurer	2a376eeb1b	[maven-release-plugin] prepare for next development iteration	2017-07-06 13:24:06 +02:00
Norman Maurer	c7f8168324	[maven-release-plugin] prepare release netty-4.1.13.Final	2017-07-06 13:23:51 +02:00
Dmitriy Dumanskiy	dd69a813d4	Performance improvement for HttpRequestEncoder. Insert char into the string optimized. Motivation: Right now HttpRequestEncoder does insertion of slash for url like http://localhost?pararm=1 before the question mark. It is done not effectively. Modification: Code: new StringBuilder(len + 1) .append(uri, 0, index) .append(SLASH) .append(uri, index, len) .toString(); Replaced with: new StringBuilder(uri) .insert(index, SLASH) .toString(); Result: Faster HttpRequestEncoder. Additional small test. Attached benchmark in PR. Benchmark Mode Cnt Score Error Units HttpRequestEncoderInsertBenchmark.newEncoder thrpt 40 3704843.303 ± 98950.919 ops/s HttpRequestEncoderInsertBenchmark.oldEncoder thrpt 40 3284236.960 ± 134433.217 ops/s	2017-06-27 10:53:43 +02:00
Nikolay Fedorovskikh	aa38b6a769	Prevent unnecessary allocations in the `StringUtil#escapeCsv` Motivation: A `StringUtil#escapeCsv` creates new `StringBuilder` on each value even if the same string is returned in the end. Modifications: Create new `StringBuilder` only if it really needed. Otherwise, return the original string (or just trimmed substring). Result: Less GC load. Up to 4x faster work for not changed strings.	2017-06-13 14:57:38 -07:00

1 2 3 4 5 ...

400 Commits