Commit Graph

126 Commits

Author SHA1 Message Date
康智冬
1c69448e2e Fix typos in javadocs (#9527)
Motivation:

We should have correct docs without typos

Modification:

Fix typos and spelling

Result:

More correct docs
2019-10-09 15:25:41 +02:00
Norman Maurer
c5a602b272 Increase maxHeaderListSize for HpackDecoderBenchmark to be able to be… (#9321)
Motivation:

The previous used maxHeaderListSize was too low which resulted in exceptions during the benchmark run:

```
io.netty.handler.codec.http2.Http2Exception: Header size exceeded max allowed size (8192)
	at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:103)
	at io.netty.handler.codec.http2.Http2Exception.headerListSizeError(Http2Exception.java:188)
	at io.netty.handler.codec.http2.Http2CodecUtil.headerListSizeExceeded(Http2CodecUtil.java:231)
	at io.netty.handler.codec.http2.HpackDecoder$Http2HeadersSink.finish(HpackDecoder.java:545)
	at io.netty.handler.codec.http2.HpackDecoder.decode(HpackDecoder.java:132)
	at io.netty.handler.codec.http2.HpackDecoderBenchmark.decode(HpackDecoderBenchmark.java:85)
	at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_thrpt_jmhStub(HpackDecoderBenchmark_decode_jmhTest.java:120)
	at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_Throughput(HpackDecoderBenchmark_decode_jmhTest.java:83)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

```

Also we should ensure we only use ascii for header names.

Modifications:

Just use Integer.MAX_VALUE as limit

Result:

Be able to run benchmark without exceptions
2019-07-04 11:24:37 +02:00
Carl Mastrangelo
65d8ecc3a0 Use Table lookup for HPACK decoder (#9307)
Motivation:
Table based decoding is fast.

Modification:
Use table based decoding in HPACK decoder, inspired by
https://github.com/python-hyper/hpack/blob/master/hpack/huffman_table.py

This modifies the table to be based on integers, rather than 3-tuples of
bytes.  This is for two reasons:

1.  It's faster
2.  Using bytes makes the static intializer too big, and doesn't
compile.

Result:
Faster Huffman decoding.  This only seems to help the ascii case, the
other decoding is about the same.

Benchmarks:

```
Before:
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score       Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   426293.636 ±  1444.843  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    57843.738 ±   725.704  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3002.412 ±    16.998  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   412339.400 ±  1128.394  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    58226.870 ±   199.591  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3044.256 ±    10.675  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2082615.030 ±  5929.726  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   10   571640.454 ± 26499.229  ops/s
HpackDecoderBenchmark.decode           false         true   LARGE  thrpt   20    92714.555 ±  2292.222  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1745872.421 ±  6788.840  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   20   490420.323 ±  2455.431  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   20    84536.200 ±   398.714  ops/s

After(bytes):
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score      Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   472649.148 ± 7122.461  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    66739.638 ±  341.607  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3139.773 ±   24.491  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   466933.833 ± 4514.971  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    66111.778 ±  568.326  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3143.619 ±    3.332  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2109995.177 ± 6203.143  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   20   586026.055 ± 1578.550  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1775723.270 ± 4932.057  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   20   493316.467 ± 1453.037  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   10    85726.219 ±  402.573  ops/s

After(ints):
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score       Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   615549.006 ±  5282.283  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    86714.630 ±   654.489  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3984.439 ±    61.612  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   602489.337 ±  5397.024  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    88399.109 ±   241.115  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3875.729 ±   103.057  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2092165.454 ± 11918.859  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   20   583465.437 ±  5452.115  ops/s
HpackDecoderBenchmark.decode           false         true   LARGE  thrpt   20    93290.061 ±   665.904  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1758402.495 ± 14677.438  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   10   491598.099 ±  5029.698  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   20    85834.290 ±   554.915  ops/s
```
2019-07-02 20:13:19 +02:00
jimin
78adeb5408 All override methods must be added @override (#9285)
Motivation:

Some methods that either override others or are implemented as part of implementation an interface did miss the `@Override` annotation

Modifications:

Add missing `@Override`s

Result:

Code cleanup
2019-06-27 13:52:06 +02:00
Alex Blewitt
e233407e01 Replace accumulation with blackhole.consume (#9275)
Motivation:

SpotJMHBugs reports that accumulating a value as a way of eliding dead code
elimination may be inadvisable, as discussed in
`JMHSample_34_SafeLooping::measureWrong_2`. Change the test so that it consumes
the response with `Blackhole::consume` instead.

Modifications:

- Replace addition of results with explicit `blackhole.consume()` call

Result:

Tests work as before, but with different benchmark numbers.
2019-06-25 21:46:26 +02:00
Francesco Nigro
c6114786ab Documented non-usage of BlackHole::consume on ByteBufAccessBenchmark (#9279)
Motivation:

Some JMH benchmarks need additional explanations to motivate
specific code choices.

Modifications:

Introduced comment to explai why calling BlackHole::consume
in a loop is not always the right choice for some benchmark.

Result:

The relevant method shows a comment that warn about changing
the code to introduce BlackHole::consume in the loop.
2019-06-25 14:53:12 +02:00
Alex Blewitt
99034a15b5 Return the result of the list.recycle() call (#9264)
Motivation:

Resolve the issue highlighted by SpotJMHBugs that the creation of the RecyclableArrayList may be elided by the JIT since the result isn't consumed or returned.

Modifications:

Return the result of `list.recycle()` so that the list isn't elided.

Result:

The JMH benchmark shows a change in performance indicating that the prior results of this may be unsound.
2019-06-22 07:21:51 +02:00
Carl Mastrangelo
f01278616a Properly debounce wakeups (#9191)
Motivation:
The wakeup logic in EpollEventLoop is overly complex

Modification:
* Simplify the race to wakeup the loop
* Dont let the event loop wake up itself (it's already awake!)
* Make event loop check if there are any more tasks after preparing to
sleep.  There is small window where the non-eventloop writers can issue
eventfd writes here, but that is okay.

Result:
Cleaner wakeup logic.

Benchmarks:

```
BEFORE
Benchmark                                   Mode  Cnt       Score      Error  Units
EpollSocketChannelBenchmark.executeMulti   thrpt   20  408381.411 ± 2857.498  ops/s
EpollSocketChannelBenchmark.executeSingle  thrpt   20  157022.360 ± 1240.573  ops/s
EpollSocketChannelBenchmark.pingPong       thrpt   20   60571.704 ±  331.125  ops/s

Benchmark                                   Mode  Cnt       Score      Error  Units
EpollSocketChannelBenchmark.executeMulti   thrpt   20  440546.953 ± 1652.823  ops/s
EpollSocketChannelBenchmark.executeSingle  thrpt   20  168114.751 ± 1176.609  ops/s
EpollSocketChannelBenchmark.pingPong       thrpt   20   61231.878 ±  520.108  ops/s
```
2019-06-04 05:27:15 -07:00
Nick Hill
23554e6997 Ensure "full" ownership of msgs passed to EmbeddedChannel.writeInbound() (#9058)
Motivation

Pipeline handlers are free to "take control" of input buffers if they have singular refcount - in particular to mutate their raw data if non-readonly via discarding of read bytes, etc.

However there are various places (primarily unit tests) where a wrapped byte-array buffer is passed in and the wrapped array is assumed not to change (used after the wrapped buffer is passed to EmbeddedChannel.writeInbound()). This invalid assumption could result in unexpected errors, such as those exposed by #8931.

Modifications

Anywhere that the data passed to writeInbound() might be used again, ensure that either:
- A copy is used rather than wrapping a shared byte array, or
- The buffer is otherwise protected from modification by making it read-only

For the tests, copying is preferred since it still allows the "mutating" optimizations to be exercised.

Results

Avoid possible errors when pipeline assumes it has full control of input buffer.
2019-05-22 12:35:03 +02:00
Francesco Nigro
635fc9eae0 The benchmark is not taking into account nanoTime granularity (#9033)
Motivation:

Results are just wrong for small delays.

Modifications:

Switching to AvarageTime avoid to rely on OS nanoTime granularity.

Result:

Uncontended low delay results are not reliable
2019-04-15 15:15:08 +02:00
Norman Maurer
0f34345347
Merge ChannelInboundHandler and ChannelOutboundHandler into ChannelHa… (#8957)
Motivation:

In 42742e233f we already added default methods to Channel*Handler and deprecated the Adapter classes to simplify the class hierarchy. With this change we go even further and merge everything into just ChannelHandler. This simplifies things even more in terms of class-hierarchy.

Modifications:

- Merge ChannelInboundHandler | ChannelOutboundHandler into ChannelHandler
- Adjust code to just use ChannelHandler
- Deprecate old interfaces.

Result:

Cleaner and simpler code in terms of class-hierarchy.
2019-03-28 09:28:27 +00:00
Norman Maurer
42742e233f
Deprecate ChannelInboundHandlerAdapter and ChannelOutboundHandlerAdapter (#8929)
Motivation:

As we now us java8 as minimum java version we can deprecate ChannelInboundHandlerAdapter / ChannelOutboundHandlerAdapter and just move the default implementations into the interfaces. This makes things a bit more flexible for the end-user and also simplifies the class-hierarchy.

Modifications:

- Mark ChannelInboundHandlerAdapter and ChannelOutboundHandlerAdapter as deprecated
- Add default implementations to ChannelInboundHandler / ChannelOutboundHandler
- Refactor our code to not use ChannelInboundHandlerAdapter / ChannelOutboundHandlerAdapter anymore

Result:

Cleanup class-hierarchy and make things a bit more flexible.
2019-03-13 09:46:10 +01:00
Nick Hill
35161ad174 Further reduce ensureAccessible() overhead (#8895)
Motivation:

This PR fixes some non-negligible overhead discovered in the ByteBuf
accessibility (non-zero refcount) checking. The cause turned out to be
mostly twofold:
- Unnecessary operations used to calculate the refcount from the "raw"
encoded int field value
- Call stack depths exceeding the default limit for inlining, in some
places (CompositeByteBuf in particular)

It's a follow-on from #8882 which uses the maxCapacity field for a
simpler non-negative check. The performance gap between these two
variants appears to be _mostly_ closed, but there's one exception which
may warrant further analysis.

Modifications:

- Replace ABB.internalRefCount() with ByteBuf.isAccessible(), the
default still checks for non-zero refCnt()
- Just test for parity of raw refCnt instead of converting to "real",
with fast-path for specific small values
- Make sure isAccessible() is delegated by derived/wrapper ByteBufs
- Use existing freed flag in CompositeByteBuf for faster isAccessible()
- Manually inline some calls in methods like CompositeByteBuf.setLong()
and AbstractReferenceCountedByteBuf.isAccessible() to reduce stack
depths (to ensure default inlining limit isn't hit)
- Add ByteBufAccessBenchmark which is an extension of
UnsafeByteBufBenchmark (maybe latter could now be removed)

Results:

Before:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
84524972.863 ±   518338.811  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
38608795.037 ±   298176.974  ops/s
readBatch           HEAP               true           true  thrpt   30
80003697.649 ±   974674.119  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
18495554.788 ±   108075.023  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
247069881.578 ± 10839162.593  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
196355905.206 ±  1802420.990  ops/s
setGetLong          HEAP               true           true  thrpt   30
245686644.713 ± 11769311.527  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
83170940.687 ±   657524.123  ops/s
setLong           UNSAFE               true           true  thrpt   30
278940253.918 ±  1807265.259  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
202556738.764 ± 11887973.563  ops/s
setLong             HEAP               true           true  thrpt   30
280045958.053 ±  2719583.400  ops/s
setLong        COMPOSITE               true           true  thrpt   30
121299806.002 ±  2155084.707  ops/s


After:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
101641801.035 ±  3950050.059  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
84395902.846 ±  4339579.057  ops/s
readBatch           HEAP               true           true  thrpt   30
100179060.207 ±  3222487.287  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
42288494.472 ±   294919.633  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
304530755.027 ±  6574163.899  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
212028547.645 ± 14277828.768  ops/s
setGetLong          HEAP               true           true  thrpt   30
309335422.609 ±  2272150.415  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
160383609.236 ±   966484.033  ops/s
setLong           UNSAFE               true           true  thrpt   30
298055969.747 ±  7437449.627  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
223784178.650 ±  9869750.095  ops/s
setLong             HEAP               true           true  thrpt   30
302543263.328 ±  8140104.706  ops/s
setLong        COMPOSITE               true           true  thrpt   30
157083673.285 ±  3528779.522  ops/s

There's also a similar knock-on improvement to other benchmarks (e.g.
HPACK encoding/decoding) as shown in #8882.

For sanity I did a final comparison of the "fast path" tweak using one
of the HPACK benchmarks:

(rawCnt & 1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score     Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  50914.479 ± 940.114  ops/s


rawCnt == 2 || rawCnt == 4 || rawCnt == 6 || rawCnt == 8 ||  (rawCnt &
1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score      Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  60036.425 ± 1478.196  ops/s
2019-02-28 20:41:16 +01:00
Dmitriy Dumanskiy
116f72db8d Legacy properties removed (#8839)
Motivation:

We can remove some properties for which we introduced replacements.

Modifications:

io.netty.buffer.bytebuf.checkAccessible, io.netty.leakDetectionLevel, org.jboss.netty.tryUnsafe properties removed

Result:

Code cleanup
2019-02-04 13:56:15 +01:00
田欧
e8efcd82a8 migrate java8: use requireNonNull (#8840)
Motivation:

We can just use Objects.requireNonNull(...) as a replacement for ObjectUtil.checkNotNull(....)

Modifications:

- Use Objects.requireNonNull(...)

Result:

Less code to maintain.
2019-02-04 10:32:25 +01:00
Dmitriy Dumanskiy
2e433889b2 Improve DateFormatter parsing performance (#8821)
Motivation:

Just was looking through code and found 1 interesting place DateFormatter.tryParseMonth that was not very effective, so I decided to optimize it a bit.

Modification:

Changed DateFormatter.tryParseMonth method. Instead of invocation regionMatch() for every month - compare chars one by one.

Result:

DateFormatter.parseHttpDate method performance improved from ~3% to ~15%.

Benchmark                                                                (DATE_STRING)   Mode  Cnt        Score       Error  Units
DateFormatter2Benchmark.parseHttpHeaderDateFormatter     Sun, 27 Jan 2016 19:18:46 GMT  thrpt    6  4142781.221 ± 82155.002  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatter     Sun, 27 Dec 2016 19:18:46 GMT  thrpt    6  3781810.558 ± 38679.061  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew  Sun, 27 Jan 2016 19:18:46 GMT  thrpt    6  4372569.705 ± 30257.537  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew  Sun, 27 Dec 2016 19:18:46 GMT  thrpt    6  4339785.100 ± 57542.660  ops/s
2019-02-04 10:04:35 +01:00
Norman Maurer
e3846c54f6
Remove ChannelHandler.exceptionCaught(...) as it should only exist in… (#8822)
Motivation:

ChannelHandler.exceptionCaught(...) was marked as @deprecated as it should only exist in inbound handlers.

Modifications:

Remove ChannelHandler.exceptionCaught(...) and adjust code / tests.

Result:

Fixes https://github.com/netty/netty/issues/8527
2019-01-31 20:29:17 +01:00
Dmitriy Dumanskiy
67b23ab056 Remove HttpHeaderDateFormat class (#8807)
Motivation:

HttpHeaderDateFormat was replaced with DateFormatter many days ago and now can be easily removed.

Modification:

Remove deprecated class and related test / benchmark

Result:

Less code to maintain
2019-01-31 07:22:20 +01:00
田欧
6222101924 migrate java8: use lambda and method reference (#8781)
Motivation:

We can use lambdas now as we use Java8.

Modification:

use lambda function for all package, #8751 only migrate transport package.

Result:

Code cleanup.
2019-01-29 14:06:05 +01:00
Norman Maurer
310f31b392
Update to new checkstyle plugin (#8777)
Motivation:

We need to update to a new checkstyle plugin to allow the usage of lambdas.

Modifications:

- Update to new plugin version.
- Fix checkstyle problems.

Result:

Be able to use checkstyle plugin which supports new Java syntax.
2019-01-24 16:24:19 +01:00
Norman Maurer
3d6e6136a9
Decouple EventLoop details from the IO handling for each transport to… (#8680)
* Decouble EventLoop details from the IO handling for each transport to allow easy re-use of code and customization

Motiviation:

As today extending EventLoop implementations to add custom logic / metrics / instrumentations is only possible in a very limited way if at all. This is due the fact that most implementations are final or even package-private. That said even if these would be public there are the ability to do something useful with these is very limited as the IO processing and task processing are very tightly coupled. All of the mentioned things are a big pain point in netty 4.x and need improvement.

Modifications:

This changeset decoubled the IO processing logic from the task processing logic for the main transport (NIO, Epoll, KQueue) by introducing the concept of an IoHandler. The IoHandler itself is responsible to wait for IO readiness and process these IO events. The execution of the IoHandler itself is done by the SingleThreadEventLoop as part of its EventLoop processing. This allows to use the same EventLoopGroup (MultiThreadEventLoupGroup) for all the mentioned transports by just specify a different IoHandlerFactory during construction.

Beside this core API change this changeset also allows to easily extend SingleThreadEventExecutor / SingleThreadEventLoop to add custom logic to it which then can be reused by all the transports. The ideas are very similar to what is provided by ScheduledThreadPoolExecutor (that is part of the JDK). This allows for example things like:

  * Adding instrumentation / metrics:
    * how many Channels are registered on an SingleThreadEventLoop
    * how many Channels were handled during the IO processing in an EventLoop run
    * how many task were handled during the last EventLoop / EventExecutor run
    * how many outstanding tasks we have
    ...
    ...
  * Implementing custom strategies for choosing the next EventExecutor / EventLoop to use based on these metrics.
  * Use different Promise / Future / ScheduledFuture implementations
  * decorate Runnable / Callables when submitted to the EventExecutor / EventLoop

As a lot of functionalities are folded into the MultiThreadEventLoopGroup and SingleThreadEventLoopGroup this changeset also removes:

  * AbstractEventLoop
  * AbstractEventLoopGroup
  * EventExecutorChooser
  * EventExecutorChooserFactory
  * DefaultEventLoopGroup
  * DefaultEventExecutor
  * DefaultEventExecutorGroup

Result:

Fixes https://github.com/netty/netty/issues/8514 .
2019-01-23 08:32:05 +01:00
Dmitriy Dumanskiy
7b92ff2500 Java 8 migration. Remove ThreadLocalProvider and inline java.util.concurrent.ThreadLocalRandom.current() where necessary. (#8762)
Motivation:

Custom Netty ThreadLocalRandom and ThreadLocalRandomProvider classes are no longer needed and can be removed.

Modification:

Remove own ThreadLocalRandom

Result:

Less code to maintain
2019-01-22 20:14:28 +01:00
田欧
9d62deeb6f Java 8 migration: Use diamond operator (#8749)
Motivation:

We can use the diamond operator these days.

Modification:

Use diamond operator whenever possible.

Result:

More modern code and less boiler-plate.
2019-01-22 16:07:26 +01:00
Norman Maurer
8fdf373557
Skip execution of Channel*Handler method if annotated with @Skip and just use the next handler in the pipeline. (#8723)
Motivation:

Invoking ChannelHandlers is not free and can result in some overhead when the ChannelPipeline becomes very long. This is especially true if most handlers will just forward the call to the next handler in the pipeline. When the user extends Channel*HandlerAdapter we can easily detect if can just skip the handler and invoke the next handler in the pipeline directly. This reduce the overhead of dispatch but also reduce the call-stack in many cases.

Modifications:

Detect if we can skip the handler when walking the pipeline.

Result:

Reduce overhead for long pipelines.

Benchmark                                       (extraHandlers)   Mode  Cnt       Score      Error  Units
DefaultChannelPipelineBenchmark.propagateEventOld             4  thrpt   10  267313.031 ± 9131.140  ops/s
DefaultChannelPipelineBenchmark.propagateEvent                4  thrpt   10  824825.673 ± 12727.594  ops/s
2019-01-22 08:58:58 +01:00
Norman Maurer
1fe931b6e2
Make it possible to use a wrapped EventLoop with a Channel (#8677)
Motiviation:

Because of how we implemented the registration / deregistration of an EventLoop it was not possible to wrap an EventLoop implementation and use it with a Channel.

Modification:

- Introduce EventLoop.Unsafe which is responsible for the actual registration.
- Move validation of EventLoop / Channel combo to the EventLoop
- Add unit test that verifies that wrapping works

Result:

Be able to wrap an EventLoop and so add some extra functionality.
2019-01-17 09:17:51 +01:00
Norman Maurer
c10ccc5dec
Tighten contract between Channel and EventLoop by require the EventLoop on Channel construction. (#8587)
Motivation:

At the moment it’s possible to have a Channel in Netty that is not registered / assigned to an EventLoop until register(...) is called. This is suboptimal as if the Channel is not registered it is also not possible to do anything useful with a ChannelFuture that belongs to the Channel. We should think about if we should have the EventLoop as a constructor argument of a Channel and have the register / deregister method only have the effect of add a Channel to KQueue/Epoll/... It is also currently possible to deregister a Channel from one EventLoop and register it with another EventLoop. This operation defeats the threading model assumptions that are wide spread in Netty, and requires careful user level coordination to pull off without any concurrency issues. It is not a commonly used feature in practice, may be better handled by other means (e.g. client side load balancing), and therefore we propose removing this feature.

Modifications:

- Change all Channel implementations to require an EventLoop for construction ( + an EventLoopGroup for all ServerChannel implementations)
- Remove all register(...) methods from EventLoopGroup
- Add ChannelOutboundInvoker.register(...) which now basically means we want to register on the EventLoop for IO.
- Change ChannelUnsafe.register(...) to not take an EventLoop as parameter (as the EventLoop is supplied on custruction).
- Change ChannelFactory to take an EventLoop to create new Channels and introduce ServerChannelFactory which takes an EventLoop and one EventLoopGroup to create new ServerChannel instances.
- Add ServerChannel.childEventLoopGroup()
- Ensure all operations on the accepted Channel is done in the EventLoop of the Channel in ServerBootstrap
- Change unit tests for new behaviour

Result:

A Channel always has an EventLoop assigned which will never change during its life-time. This ensures we are always be able to call any operation on the Channel once constructed (unit the EventLoop is shutdown). This also simplifies the logic in DefaultChannelPipeline a lot as we can always call handlerAdded / handlerRemoved directly without the need to wait for register() to happen.

Also note that its still possible to deregister a Channel and register it again. It's just not possible anymore to move from one EventLoop to another (which was not really safe anyway).

Fixes https://github.com/netty/netty/issues/8513.
2019-01-14 20:11:13 +01:00
Norman Maurer
d9a6cf341c
Remove support for marking reader and writerIndex in ByteBuf to reduce overhead and complexity. (#8636)
Motivation:

ByteBuf supports “marker indexes”. The intended use case for these is if a speculative operation (e.g. decode) is in process the user can “mark” and interface and refer to it later if the operation isn’t successful (e.g. not enough data). However this is rarely used in practice,
requires extra memory to maintain, and introduces complexity in the state management for derived/pooled buffer initialization, resizing, and other operations which may modify reader/writer indexes.

Modifications:

Remove support for marking and adjust testcases / code.

Result:

Fixes https://github.com/netty/netty/issues/8535.
2018-12-11 14:00:49 +01:00
Francesco Nigro
4c2b11633a Adding an execute burst cost benchmark for Netty executors (#8594)
Motivation:

Netty executors doesn't have yet any means to compare with each others
nor to compare with the j.u.c. executors

Modifications:

A new benchmark measuring execute burst cost is being added

Result:

It's now possible to compare some of Netty executors with each others
and with the j.u.c. executors
2018-12-04 15:46:48 +01:00
Nick Hill
10539f4dc7 Streamline CompositeByteBuf internals (#8437)
Motivation:

CompositeByteBuf is a powerful and versatile abstraction, allowing for
manipulation of large data without copying bytes. There is still a
non-negligible cost to reading/writing however relative to "singular"
ByteBufs, and this can be mostly eliminated with some rework of the
internals.

My use case is message modification/transformation while zero-copy
proxying. For example replacing a string within a large message with one
of a different length

Modifications:

- No longer slice added buffers and unwrap added slices
   - Components store target buf offset relative to position in
composite buf
   - Less allocations, object footprint, pointer indirection, offset
arithmetic
- Use Component[] rather than ArrayList<Component>
   - Avoid pointer indirection and duplicate bounds check, more
efficient backing array growth
   - Facilitates optimization when doing bulk-inserts - inserting n
ByteBufs behind m is now O(m + n) instead of O(mn)
- Avoid unnecessary casting and method call indirection via superclass
- Eliminate some duplicate range/ref checks via non-checking versions of
toComponentIndex and findComponent
- Add simple fast-path for toComponentIndex(0); add racy cache of
last-accessed Component to findComponent(int)
- Override forEachByte0(...) and forEachByteDesc0(...) methods
- Make use of RecyclableArrayList in nioBuffers(int, int) (in line with
FasterCompositeByteBuf impl)
- Modify addComponents0(boolean,int,Iterable) to use the Iterable
directly rather than copy to an array first (and possibly to an
ArrayList before that)
- Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform
repeated array insertions and avoid second loop for offset updates
- Simplify other logic in various places, in particular the general
pattern used where a sub-range is iterated over
- Add benchmarks to demonstrate some improvements

While refactoring I also came across a couple of clear bugs. They are
fixed in these changes but I will open another PR with unit tests and
fixes to the current version.

Result:

Much faster creation, manipulation, and access; many fewer allocations
and smaller footprint. Benchmark results to follow.
2018-11-03 10:37:07 +01:00
Nick Hill
583d838f7c Optimize AbstractByteBuf.getCharSequence() in US_ASCII case (#8392)
* Optimize AbstractByteBuf.getCharSequence() in US_ASCII case

Motivation:

Inspired by https://github.com/netty/netty/pull/8388, I noticed this
simple optimization to avoid char[] allocation (also suggested in a TODO
here).

Modifications:

Return an AsciiString from AbstractByteBuf.getCharSequence() if
requested charset is US_ASCII or ISO_8859_1 (latter thanks to
@Scottmitch's suggestion). Also tweak unit tests not to require Strings
and include a new benchmark to demonstrate the speedup.

Result:

Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1
cases
2018-10-26 15:32:38 -07:00
Norman Maurer
87ec2f882a
Reduce overhead by ByteBufUtil.decodeString(...) which is used by AbstractByteBuf.toString(...) and AbstractByteBuf.getCharSequence(...) (#8388)
Motivation:

Our current implementation that is used for toString(Charset) operations on AbstractByteBuf implementation is quite slow as it does a lot of uncessary memory copies. We should just use new String(...) as it has a lot of optimizations to handle these cases.

Modifications:

Rewrite ByteBufUtil.decodeString(...) to use new String(...)

Result:

Less overhead for toString(Charset) operations.

Benchmark                                         (charsetName)  (direct)  (size)   Mode  Cnt         Score         Error  Units
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII     false       8  thrpt   20  22401645.093 ? 4671452.479  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII     false      64  thrpt   20  23678483.384 ? 3749164.446  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII      true       8  thrpt   20  15731142.651 ? 3782931.591  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII      true      64  thrpt   20  16244232.229 ? 1886259.658  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8     false       8  thrpt   20  25983680.959 ? 5045782.289  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8     false      64  thrpt   20  26235589.339 ? 2867004.950  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8      true       8  thrpt   20  18499027.808 ? 4784684.268  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8      true      64  thrpt   20  16825286.141 ? 1008712.342  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16     false       8  thrpt   20   5789879.092 ? 1201786.359  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16     false      64  thrpt   20   2173243.225 ?  417809.341  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16      true       8  thrpt   20   5035583.011 ? 1001978.854  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16      true      64  thrpt   20   2162345.301 ?  402410.408  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1     false       8  thrpt   20  30039052.376 ? 6539111.622  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1     false      64  thrpt   20  31414163.515 ? 2096710.526  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1      true       8  thrpt   20  19538587.855 ? 4639115.572  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1      true      64  thrpt   20  19467839.722 ? 1672687.213  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII     false       8  thrpt   20  10787326.745 ? 1034197.864  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII     false      64  thrpt   20   7129801.930 ? 1363019.209  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII      true       8  thrpt   20   9002529.605 ? 2017642.445  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII      true      64  thrpt   20   3860192.352 ?  826218.738  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8     false       8  thrpt   20  10532838.027 ? 2151743.968  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8     false      64  thrpt   20   7185554.597 ? 1387685.785  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8      true       8  thrpt   20   7352253.316 ? 1333823.850  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8      true      64  thrpt   20   2825578.707 ?  349701.156  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16     false       8  thrpt   20   7277446.665 ? 1447034.346  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16     false      64  thrpt   20   2445929.579 ?  562816.641  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16      true       8  thrpt   20   6201174.401 ? 1236137.786  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16      true      64  thrpt   20   2310674.973 ?  525587.959  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1     false       8  thrpt   20  11142625.392 ? 1680556.468  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1     false      64  thrpt   20   8127116.405 ? 1128513.860  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1      true       8  thrpt   20   9405751.952 ? 2193324.806  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1      true      64  thrpt   20   3943282.076 ?  737798.070  ops/s

Benchmark result is saved to /home/norman/mainframer/netty/microbench/target/reports/performance/ByteBufUtilDecodeStringBenchmark.json
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,030.173 sec - in io.netty.buffer.ByteBufUtilDecodeStringBenchmark
[1030.460s][info   ][gc,heap,exit ] Heap
[1030.460s][info   ][gc,heap,exit ]  garbage-first heap   total 516096K, used 257918K [0x0000000609a00000, 0x0000000800000000)
[1030.460s][info   ][gc,heap,exit ]   region size 2048K, 127 young (260096K), 2 survivors (4096K)
[1030.460s][info   ][gc,heap,exit ]  Metaspace       used 17123K, capacity 17438K, committed 17792K, reserved 1064960K
[1030.460s][info   ][gc,heap,exit ]   class space    used 1709K, capacity 1827K, committed 1920K, reserved 1048576K
2018-10-19 14:00:13 +02:00
Norman Maurer
e542a2cf26
Use a non-volatile read for ensureAccessible() whenever possible to reduce overhead and allow better inlining. (#8266)
Motiviation:

At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives.

It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here.

The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check.

Modifications:

- Try to do a non-volatile read using sun.misc.Unsafe if we can use it.
- Add a benchmark

Result:

Big performance win when multiple ByteBuf methods are called from a method.

With the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf  thrpt   20  281395842,128 ± 5050792,296  ops/s

Before the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf  thrpt   20  217419832,801 ± 5080579,030  ops/s
2018-09-07 07:47:02 +02:00
Norman Maurer
02d559e6a4
Remove flags when running benchmarks. (#8262)
Motivation:

Some of the flags we used are not supported anymore on more recent JDK versions. We should just remove all of them and only keep what we really need. This may also reflect better what people use in production.

Modifications:

Remove some flags when running the benchmarks.

Result:

Benchmarks also run with JDK11.
2018-09-05 19:05:02 +02:00
Carl Mastrangelo
379a56ca49 Add an Epoll benchmark
Motivation:
Optimizing the Epoll channel needs an objective measure of how fast
it is.

Modification:
Add a simple, closed loop,  ping-pong benchmark.

Result:
Benchmark can be used to measure #7816

Initial numbers:

```
Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong":
  22614.403 ±(99.9%) 797.263 ops/s [Average]
  (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130
  CI (99.9%): [21817.140, 23411.666] (assumes normal distribution)

Benchmark                              Mode  Cnt      Score     Error  Units
EpollSocketChannelBenchmark.pingPong  thrpt   20  22614.403 ± 797.263  ops/s
```
2018-09-04 10:15:15 +02:00
Francesco Nigro
c78be33443 Added configurable ByteBuf bounds checking (#7521)
Motivation:

The JVM isn't always able to hoist out/reduce bounds checking (due to ref counting operations etc etc) hence making it configurable could improve performances for most CPU intensive use cases.

Modifications:

Each AbstractByteBuf bounds check has been tested against a new static final configuration property similar to checkAccessible ie io.netty.buffer.bytebuf.checkBounds.

Result:

Any user could disable ByteBuf bounds checking in order to get extra performances.
2018-09-03 20:33:47 +02:00
Norman Maurer
83710cb2e1
Replace toArray(new T[size]) with toArray(new T[0]) to eliminate zero-out and allow the VM to optimize. (#8075)
Motivation:

Using toArray(new T[0]) is usually the faster aproach these days. We should use it.

See also https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_conclusion.

Modifications:

Replace toArray(new T[size]) with toArray(new T[0]).

Result:

Faster code.
2018-06-29 07:56:04 +02:00
unknown
4a8d3a274c Including the setup code in the benchmark method to avoid JMH Invocation level hiccups.
Motivation:

The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant overhead
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).

In the case of CodecInputListBenchmark, benchmarks are far too small (less than 50ns) and the Invocation
level setup offsets the measurement considerably.
On such cases, the recommended fix patch is to include the setup/teardown code in the benchmark method.

Modifications:

Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown methods from the benchmark class.

Result:

We run the entire benchmark 10 times with default parameters we observed:
- ArrayList benchmark affected directly by JMH overhead is now from 15-80% faster.
- CodecList benchmark is now 50% faster than original (even with the setup code being measured).
- Recyclable ArrayList is ~30% slower.
- All benchmarks have significant different means (ANOVA) and medians (Moore)

Mode: Throughput (Higher the better)

Method	              Full params		Factor	    Modified (Median)	Original (Median)
recyclableArrayList	 (elements = 1)		0.615520967	21719082.75	        35285691.2
recyclableArrayList	 (elements = 4)		0.699553431	17149442.76	        24514843.31
arrayList	         (elements = 4)		1.152666631	27120407.18	        23528404.88
codecOutList	     (elements = 1)		1.527275908	67251089.04	        44033359.47
codecOutList	     (elements = 4)		1.596917095	59174088.78	        37055204.03
arrayList	         (elements = 1)		1.878616889	62188238.24	        33103204.06

Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ  (6 cores + HT), 64 GB RAM.
2018-06-21 12:22:13 +02:00
unknown
cb420a9ffc Including the setup code in the benchmark method to avoid JMH Invocation level hiccups.
Motivation:

The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant impact in
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).

When the benchmark and the setup/teardown is too small (less than a milisecond) the Invocation level might saturate the system with
timestamp requests and iteration synchronizations which introduce artificial latency, throughput, and scalability bottlenecks.

In the HeadersBenchmark, all benchmarks take less than 100ns and the Invocation level setup offsets the measurement considerably.
As fixture methods is defined for the entire class, this overhead also impacts every single benchmark in this class, not only
the ones that use the emptyHttpHeaders object (cleaned in the setup).

The recommended fix patch here is to include the setup/teardown code in the benchmark where the object is used.

Modifications:

Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown method of Invocation level from the benchmark class.

Result:

We run all benchmarks from HeadersBenchmark 10 times with default parameter, we observe:
- Benchmarks that were not directly affected by the fix patch, improved execution time.
    For instance, http2Remove with (exampleHeader = THREE) had its median reported as 2x faster than the original version.
- Benchmarks that had the setup code inserted (eg. http2AddAllFastest) did not suffer a significant punch in the execution time,
as the benchmarks are not dominated by the clear().

Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ  (6 cores + HT), 64 GB RAM.
2018-06-21 12:21:19 +02:00
Scott Mitchell
9d51a40df0 Update NetUtilBenchmark (#7826)
Motivation:
NetUtilBenchmark is using out of date data, throws an exception in the benchmark, and allocates a Set on each run.

Modifications:
- Update the benchmark and reduce each run's overhead

Result:
NetUtilBenchmark is updated.
2018-03-31 08:27:08 +02:00
Francesco Nigro
ed46c4ed00 Copies from read-only heap ByteBuffer to direct ByteBuf can avoid stealth ByteBuf allocation and additional copies
Motivation:

Read-only heap ByteBuffer doesn't expose array: the existent method to perform copies to direct ByteBuf involves the creation of a (maybe pooled) additional heap ByteBuf instance and copy

Modifications:

To avoid stressing the allocator with additional (and stealth) heap ByteBuf allocations is provided a method to perform copies using the (pooled) internal NIO buffer

Result:

Copies from read-only heap ByteBuffer to direct ByteBuf won't create any intermediate ByteBuf
2018-02-27 09:54:21 +09:00
Julien Hoarau
3e6b54bb59 Fix failing h2spec tests 8.1.2.1 related to pseudo-headers validation
Motivation:

According to the spec:
All pseudo-header fields MUST appear in the header block before regular
header fields. Any request or response that contains a pseudo-header
field that appears in a header block after
a regular header field MUST be treated as malformed (Section 8.1.2.6).

Pseudo-header fields are only valid in the context in which they are defined.
Pseudo-header fields defined for requests MUST NOT appear in responses;
pseudo-header fields defined for responses MUST NOT appear in requests.
Pseudo-header fields MUST NOT appear in trailers.
Endpoints MUST treat a request or response that contains undefined or
invalid pseudo-header fields as malformed (Section 8.1.2.6).

Clients MUST NOT accept a malformed response. Note that these requirements
are intended to protect against several types of common attacks against HTTP;
they are deliberately strict because being permissive can expose
implementations to these vulnerabilities.

Modifications:

- Introduce validation in HPackDecoder

Result:

- Requests with unknown pseudo-field headers are rejected
- Requests with containing response specific pseudo-headers are rejected
- Requests where pseudo-header appear after regular header are rejected
- h2spec 8.1.2.1 pass
2018-01-29 19:42:56 -08:00
Norman Maurer
4c1e0f596a Use FastThreadLocal for CodecOutputList
Motivation:

We used Recycler for the CodecOutputList which is not optimized for the use-case of access only from the same Thread all the time.

Modifications:

- Use FastThreadLocal for CodecOutputList
- Add benchmark

Result:

Less overhead in our codecs.
2018-01-23 11:34:28 +01:00
Francesco Nigro
1cf2687244 Fixed JMH ByteBuf benchmark to avoid dead code elimination
Motivation:

The JMH doc suggests to use BlackHoles to avoid dead code elimination hence would be better to follow this best practice.

Modifications:

Each benchmark method is returning the ByteBuf/ByteBuffer to avoid the JVM to perform any dead code elimination.

Result:

The results are more reliable and comparable to the others provided by other ByteBuf benchmarks (eg HeapByteBufBenchmark)
2017-12-19 14:09:18 +01:00
Scott Mitchell
55ef09f191 Add HttpObjectEncoderBenchmark
Motivation:
Benchmark to measure HttpObjectEncoder performance.

Modifications:
- Create new benchmark HttpObjectEncoderBenchmark

Result:
JMH Microbenchmark for HttpObjectEncoder.
2017-12-16 13:47:34 +01:00
Scott Mitchell
5f0342ebe0 Add RedisEncoderBenchmark
Motivation:
Add a benchmark to measure RedisEncoder's performance

Modifications:
- Add RedisEncoderBenchmark

Result:
JMH benchmark exists to measure RedisEncoder's performance.
2017-12-16 13:42:50 +01:00
Scott Mitchell
93b144b7b4 HttpMethod#valueOf improvement
Motivation:
HttpMethod#valueOf shows up on profiler results in the top set of
results. Since it is a relatively simple operation it can be improved in
isolation.

Modifications:
- Introduce a special case map which assigns each HttpMethod to a unique
index in an array and provides constant time lookup from a hash code
algorithm. When the bucket is matched we can then directly do equality
comparison instead of potentially following a linked structure when
HashMap has hash collisions.

Result:
~10% improvement in benchmark results for HttpMethod#valueOf

Benchmark                                     Mode  Cnt   Score   Error   Units
HttpMethodMapBenchmark.newMapKnownMethods    thrpt   16  31.831 ± 0.928  ops/us
HttpMethodMapBenchmark.newMapMixMethods      thrpt   16  25.568 ± 0.400  ops/us
HttpMethodMapBenchmark.newMapUnknownMethods  thrpt   16  51.413 ± 1.824  ops/us
HttpMethodMapBenchmark.oldMapKnownMethods    thrpt   16  29.226 ± 0.330  ops/us
HttpMethodMapBenchmark.oldMapMixMethods      thrpt   16  21.073 ± 0.247  ops/us
HttpMethodMapBenchmark.oldMapUnknownMethods  thrpt   16  49.081 ± 0.577  ops/us
2017-11-20 11:07:50 -08:00
Scott Mitchell
e6126215e0 DefaultHttp2FrameWriter reduce object allocation
Motivation:
DefaultHttp2FrameWriter#writeData allocates a DataFrameHeader for each write operation. DataFrameHeader maintains internal state and allocates multiple slices of a buffer which is a maximum of 30 bytes. This 30 byte buffer may not always be necessary and the additional slice operations can utilize retainedSlice to take advantage of pooled objects. We can also save computation and object allocations if there is no padding which is a common case in practice.

Modifications:
- Remove DataFrameHeader
- Add a fast path for padding == 0

Result:
Less object allocation in DefaultHttp2FrameWriter
2017-11-20 08:10:59 -08:00
Anuraag Agrawal
1f1a60ae7d Use Netty's DefaultPriorityQueue instead of JDK's PriorityQueue for scheduled tasks
Motivation:

`AbstractScheduledEventExecutor` uses a standard `java.util.PriorityQueue` to keep track of task deadlines. `ScheduledFuture.cancel` removes tasks from this `PriorityQueue`. Unfortunately, `PriorityQueue.remove` has `O(n)` performance since it must search for the item in the entire queue before removing it. This is fast when the future is at the front of the queue (e.g., already triggered) but not when it's randomly located in the queue.

Many servers will use `ScheduledFuture.cancel` on all requests, e.g., to manage a request timeout. As these cancellations will be happen in arbitrary order, when there are many scheduled futures, `PriorityQueue.remove` is a bottleneck and greatly hurts performance with many concurrent requests (>10K).

Modification:

Use netty's `DefaultPriorityQueue` for scheduling futures instead of the JDK. `DefaultPriorityQueue` is almost identical to the JDK version except it is able to remove futures without searching for them in the queue. This means `DefaultPriorityQueue.remove` has `O(log n)` performance.

Result:

Before - cancelling futures has varying performance, capped at `O(n)`
After - cancelling futures has stable performance, capped at `O(log n)`

Benchmark results

After - cancelling in order and in reverse order have similar performance within `O(log n)` bounds
```
Benchmark                                           (num)   Mode  Cnt       Score      Error  Units
ScheduledFutureTaskBenchmark.cancelInOrder            100  thrpt   20  137779.616 ± 7709.751  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder           1000  thrpt   20   11049.448 ±  385.832  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder          10000  thrpt   20     943.294 ±   12.391  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder         100000  thrpt   20      64.210 ±    1.824  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder     100  thrpt   20  167531.096 ± 9187.865  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder    1000  thrpt   20   33019.786 ± 4737.770  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder   10000  thrpt   20    2976.955 ±  248.555  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder  100000  thrpt   20     362.654 ±   45.716  ops/s
```

Before - cancelling in order and in reverse order have significantly different performance at higher queue size, orders of magnitude worse than the new implementation.
```
Benchmark                                           (num)   Mode  Cnt       Score       Error  Units
ScheduledFutureTaskBenchmark.cancelInOrder            100  thrpt   20  139968.586 ± 12951.333  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder           1000  thrpt   20   12274.420 ±   337.800  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder          10000  thrpt   20     958.168 ±    15.350  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder         100000  thrpt   20      53.381 ±    13.981  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder     100  thrpt   20  123918.829 ±  3642.517  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder    1000  thrpt   20    5099.810 ±   206.992  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder   10000  thrpt   20      72.335 ±     0.443  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder  100000  thrpt   20       0.743 ±     0.003  ops/s
```
2017-11-10 23:09:32 -08:00
Carl Mastrangelo
83a19d5650 Optimistically update ref counts
Motivation:
Highly retained and released objects have contention on their ref
count.  Currently, the ref count is updated using compareAndSet
with care to make sure the count doesn't overflow, double free, or
revive the object.

Profiling has shown that a non trivial (~1%) of CPU time on gRPC
latency benchmarks is from the ref count updating.

Modification:
Rather than pessimistically assuming the ref count will be invalid,
optimistically update it assuming it will be.  If the update was
wrong, then use the slow path to revert the change and throw an
execption.  Most of the time, the ref counts are correct.

This changes from using compareAndSet to getAndAdd, which emits a
different CPU instruction on x86 (CMPXCHG to XADD).  Because the
CPU knows it will modifiy the memory, it can avoid contention.

On a highly contended machine, this can be about 2x faster.

There is a downside to the new approach.  The ref counters can
temporarily enter invalid states if over retained or over released.
The code does handle these overflow and underflow scenarios, but it
is possible that another concurrent access may push the failure to
a different location.  For example:

Time 1 Thread 1: obj.retain(INT_MAX - 1)
Time 2 Thread 1: obj.retain(2)
Time 2 Thread 2: obj.retain(1)

Previously Thread 2 would always succeed and Thread 1 would always
fail on the second access.  Now, thread 2 could fail while thread 1
is rolling back its change.

====

There are a few reasons why I think this is okay:

1. Buggy code is going to have bugs.  An exception _is_ going to be
   thrown.  This just causes the other threads to notice the state
   is messed up and stop early.
2. If high retention counts are a use case, then ref count should
   be a long rather than an int.
3. The critical section is greatly reduced compared to the previous
   version, so the likelihood of this happening is lower
4. On error, the code always rollsback the change atomically, so
   there is no possibility of corruption.

Result:
Faster refcounting

```
BEFORE:

Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  2901361       804.579 ±  1.835  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3038729       785.376 ± 16.471  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  2899401       817.392 ±  6.668  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3650566      2077.700 ±  0.600  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3005467     19949.334 ±  4.243  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   456091        48.610 ±  1.162  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   732051        62.599 ±  0.815  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   778925       228.629 ±  1.205  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   633682      2002.987 ±  2.856  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506442     19735.345 ± 12.312  ns/op

AFTER:
Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  3761980       383.436 ±  1.315  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3667304       474.429 ±  1.101  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  3039374       479.267 ±  0.435  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3709210      2044.603 ±  0.989  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3011591     19904.227 ± 18.025  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   494975        52.269 ±  8.345  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   771094        62.290 ±  0.795  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   763230       235.044 ±  1.552  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   634037      2006.578 ±  3.574  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506284     19742.605 ± 13.729  ns/op

```
2017-10-04 08:42:33 +02:00
Norman Maurer
3c8c7fc7e9 Reduce performance overhead of ResourceLeakDetector
Motiviation:

The ResourceLeakDetector helps to detect and troubleshoot resource leaks and is often used even in production enviroments with a low level. Because of this its import that we try to keep the overhead as low as overhead. Most of the times no leak is detected (as all is correctly handled) so we should keep the overhead for this case as low as possible.

Modifications:

- Only call getStackTrace() if a leak is reported as it is a very expensive native call. Also handle the filtering and creating of the String in a lazy fashion
- Remove the need to mantain a Queue to store the last access records
- Add benchmark

Result:

Huge decrease of performance overhead.

Before the patch:

Benchmark                                           (recordTimes)   Mode  Cnt     Score     Error  Units
ResourceLeakDetectorRecordBenchmark.record                      8  thrpt   20  4358.367 ± 116.419  ops/s
ResourceLeakDetectorRecordBenchmark.record                     16  thrpt   20  2306.027 ±  55.044  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint              8  thrpt   20  4220.979 ± 114.046  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint             16  thrpt   20  2250.734 ±  55.352  ops/s

With this patch:

Benchmark                                           (recordTimes)   Mode  Cnt      Score      Error  Units
ResourceLeakDetectorRecordBenchmark.record                      8  thrpt   20  71398.957 ± 2695.925  ops/s
ResourceLeakDetectorRecordBenchmark.record                     16  thrpt   20  38643.963 ± 1446.694  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint              8  thrpt   20  71677.882 ± 2923.622  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint             16  thrpt   20  38660.176 ± 1467.732  ops/s
2017-09-18 16:36:19 -07:00