Commit Graph

364 Commits

Author SHA1 Message Date
Norman Maurer
2dae6665f4 Fix caching for normal allocations (#10825)
Motivation:

https://github.com/netty/netty/pull/10267 introduced a change that reduced the fragmentation. Unfortunally it also introduced a regression when it comes to caching of normal allocations. This can have a negative performance impact depending on the allocation sizes.

Modifications:

- Fix algorithm to calculate the array size for normal allocation caches
- Correctly calculate indeox for normal caches
- Add unit test

Result:

Fixes https://github.com/netty/netty/issues/10805
2020-11-25 15:09:39 +01:00
Frédéric Brégier
3a58063fe7 Fix for performance regression on HttpPost RequestDecoder (#10623)
Fix issue #10508 where PARANOID mode slow down about 1000 times compared to ADVANCED.
Also fix a rare issue when internal buffer was growing over a limit, it was partially discarded
using `discardReadBytes()` which causes bad changes within previously discovered HttpData.

Reasons were:

Too many `readByte()` method calls while other ways exist (such as keep in memory the last scan position when trying to find a delimiter or using `bytesBefore(firstByte)` instead of looping externally).

Changes done:
- major change on way buffer are parsed: instead of read byte per byte until found delimiter, try to find the delimiter using `bytesBefore()` and keep the last unfound position to skeep already parsed parts (algorithms are the same but implementation of scan are different)
- Change the condition to discard read bytes when refCnt is at most 1.

Observations using Async-Profiler:
==================================

1) Without optimizations, most of the time (more than 95%) is through `readByte()` method within `loadDataMultipartStandard` method.
2) With using `bytesBefore(byte)` instead of `readByte()` to find various delimiter, the `loadDataMultipartStandard` method is going down to 19 to 33% depending on the test used. the `readByte()` method or equivalent `getByte(pos)` method are going down to 15% (from 95%).

Times are confirming those profiling:
- With optimizations, in SIMPLE mode about 82% better, in ADVANCED mode about 79% better and in PARANOID mode about 99% better (most of the duplicate read accesses are removed or make internally through `bytesBefore(byte)` method)

A benchmark is added to show the behavior of the various cases (one big item, such as File upload, and many items) and various level of detection (Disabled, Simple, Advanced, Paranoid). This benchmark is intend to alert if new implementations make too many differences (such as the previous version where about PARANOID gives about 1000 times slower than other levels, while it is now about at most 10 times).

Extract of Benchmark run:
=========================

Run complete. Total time: 00:13:27

Benchmark                                                                           Mode  Cnt  Score   Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  2,248 ± 0,198 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  2,067 ± 1,219 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  1,109 ± 0,038 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  2,326 ± 0,314 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  1,444 ± 0,226 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  1,462 ± 0,642 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,159 ± 0,003 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  1,522 ± 0,049 ops/ms
2020-11-19 08:01:05 +01:00
Norman Maurer
eeece4cfa5 Use http in xmlns URIs to make maven release plugin happy again (#10788)
Motivation:

https in xmlns URIs does not work and will let the maven release plugin fail:

```
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.779 s
[INFO] Finished at: 2020-11-10T07:45:21Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:2.5.3:prepare (default-cli) on project netty-parent: Execution default-cli of goal org.apache.maven.plugins:maven-release-plugin:2.5.3:prepare failed: The namespace xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" could not be added as a namespace to "project": The namespace prefix "xsi" collides with an additional namespace declared by the element -> [Help 1]
[ERROR]
```

See also https://issues.apache.org/jira/browse/HBASE-24014.

Modifications:

Use http for xmlns

Result:

Be able to use maven release plugin
2020-11-10 10:51:05 +01:00
Chris Vest
a6b749843f Use JUnit 5 for running all tests (#10764)
Motivation:
JUnit 5 is the new hotness. It's more expressive, extensible, and composable in many ways, and it's better able to run tests in parallel. But most importantly, it's able to directly run JUnit 4 tests.
This means we can update and start using JUnit 5 without touching any of our existing tests.
I'm also introducing a dependency on assertj-core, which is like hamcrest, but arguably has a nicer and more discoverable API.

Modification:
Add the JUnit 5 and assertj-core dependencies, without converting any tests at time time.

Result:
All our tests are now executed through the JUnit 5 Vintage Engine.
Also, the JUnit 5 test APIs are available, and any JUnit 5 tests that are added from now on will also be executed.
2020-11-04 10:21:03 +01:00
Artem Smotrakov
b8ae2a2af4 Enable nohttp check during the build (#10708)
Motivation:

HTTP is a plaintext protocol which means that someone may be able
to eavesdrop the data. To prevent this, HTTPS should be used whenever
possible. However, maintaining using https:// in all URLs may be
difficult. The nohttp tool can help here. The tool scans all the files
in a repository and reports where http:// is used.

Modifications:

- Added nohttp (via checkstyle) into the build process.
- Suppressed findings for the websites
  that don't support HTTPS or that are not reachable

Result:

- Prevent using HTTP in the future.
- Encourage users to use HTTPS when they follow the links they found in
  the code.
2020-10-23 15:26:25 +02:00
Francesco Nigro
4624b6309d Reduce DefaultAttributeMap lookup cost (#10530)
Motivation:

DefaultAttributeMap::attr has a blocking behaviour on lookup of an existing attribute:
it can be made non-blocking.

Modification:

Replace the existing fixed bucket table using a locked intrusive linked list
with an hand-rolled copy-on-write ordered single array

Result:
Non blocking behaviour for the lookup happy path
2020-10-02 21:19:03 +02:00
Chris Vest
1d7efbddd9 Fix compilation after forward port of #10368
Motivation: Code failed to compile because ByteBuf index marking has been removed.
Modification: Index marking wasn't really used anyway, so just set the relevant index to zero.
Result: Code compiles again.
2020-09-09 16:27:52 +02:00
Francesco Nigro
7f86f90646 Improve predictability of writeUtf8/writeAscii performance (#10368)
Motivation:

writeUtf8 can suffer from inlining issues and/or megamorphic call-sites on the hot path due to ByteBuf hierarchy

Modifications:

Duplicate and specialize the code paths to reduce the need of polymorphic calls

Result:

Performance are more stable in user code
2020-09-09 16:15:22 +02:00
Francesco Nigro
319a4bc3ba Reduce garbage on MQTT (#10509)
Reduce garbage on MQTT encoding

Motivation:

MQTT encoding and decoding is doing unnecessary object allocation in a number of places:
- MqttEncoder create many byte[] to encode Strings into UTF-8 bytes
- MqttProperties uses Integer keys instead of int
- Some enums valueOf create unnecessary arrays on the hot paths
- MqttDecoder was using unecessary Result<T>

Modification:

- ByteBufUtil::utf8Bytes and ByteBufUtil::reserveAndWriteUtf8 allows to perform the same operation GC-free
- MqttProperties uses a primitive key map
- Implemented GC free const table lookup/switch valueOf
- Use some bit-tricks to pack 2 ints into a single primitive long to store both result and numberOfBytesConsumed and use byte[].length to compute numberOfByteConsumed on fly. These changes allowed to save creating Result<T>.

Result:
Significantly less garbage produced in MQTT encoding/decoding
2020-09-04 18:31:53 +02:00
Francesco Nigro
0a8c9192e5 Improve MqttMessageType::valueOf cost (#10400)
Motivation:

MqttMessageType::valueOf has O(N) cost

Modifications:

MqttMessageType::valueOf uses a const lookup table

Result:

MqttMessageType::valueOf has O(1) cost
2020-08-31 10:32:48 +02:00
Linas Medžiūnas
abdcf102da Efficient BytBuf search algorithms (#9914) (#9955)
Motivation:

We have found out that ByteBufUtil.indexOf can be inefficient for substring search on
ByteBuf, both in terms of algorithm complexity (worst case O(needle.readableBytes *
haystack.readableBytes)), and in constant factor (esp. on Composite buffers).
With implementation of more performant search algorithms we have seen improvements on
the order of magnitude.

Modifications:

This change introduces three search algorithms:
1. Knuth Morris Pratt - classical textbook algorithm, a good default choice.
2. Bit mask based algorithm - stable performance on any input, but limited to maximum
search substring (the needle) length of 64 bytes.
3. Aho–Corasick - worse performance and higher memory consumption than [1] and [2], but
it supports multiple substring (the needles) search simultaneously, by inspecting every
byte of the haystack only once.

Each algorithm processes every byte of underlying buffer only once, they are implemented
as ByteProcessor.

Result:

Efficient search algorithms with linear time complexity available in Netty (I will share
benchmark results in a comment on a PR).
2020-04-15 10:26:53 +02:00
Dmitry Konstantinov
dc69c04434 Replace usage() with freeBytes() in thresholds within hot paths of PoolChunkList (#10141)
Motivation:
PoolChunk.usage() method has non-trivial computations. It is used currently in hot path methods invoked when an allocation and de-allocation are happened.
The idea is to replace usage() output comparison against percent thresholds by Chunk.freeBytes plain comparison against absolute thresholds. In such way the majority of computations from the threshold conditions are moved to init logic.

Modifications:
Replace PoolChunk.usage() conditions in PoolChunkList with equivalent conditions for PoolChunk.freeBytes()

Result:
Improve performance of allocation and de-allocation of ByteBuf from normal size cache pool
2020-03-31 22:11:42 +02:00
Norman Maurer
6a43807843
Use lambdas whenever possible (#9979)
Motivation:

We should update our code to use lamdas whenever possible

Modifications:

Use lambdas when possible

Result:

Cleanup code for Java8
2020-01-30 09:28:24 +01:00
Norman Maurer
9e29c39daa
Cleanup usage of Channel*Handler (#9959)
Motivation:

In next major version of netty users should use ChannelHandler everywhere. We should ensure we do the same

Modifications:

Replace usage of deprecated classes / interfaces with ChannelHandler

Result:

Use non-deprecated code
2020-01-20 17:47:17 -08:00
Francesco Nigro
1e4f0e6a09 Faster decodeHexNibble (#9896)
Motivation:

decodeHexNibble can be a lot faster using a lookup table

Modifications:

decodeHexNibble is made faster by using a lookup table

Result:

decodeHexNibble is faster
2019-12-23 21:16:44 +01:00
Anuraag Agrawal
ee206b6ba8 Separate out query string encoding for non-encoded strings. (#9887)
Motivation:

Currently, characters are appended to the encoded string char-by-char even when no encoding is needed. We can instead separate out codepath that appends the entire string in one go for better `StringBuilder` allocation performance.

Modification:

Only go into char-by-char loop when finding a character that requires encoding.

Result:

The results aren't so clear with noise on my hot laptop - the biggest impact is on long strings, both to reduce resizes of the buffer and also to reduce complexity of the loop. I don't think there's a significant downside though for the cases that hit the slow path.

After
```
Benchmark                                     Mode  Cnt   Score   Error   Units
QueryStringEncoderBenchmark.longAscii        thrpt    6   1.406 ± 0.069  ops/us
QueryStringEncoderBenchmark.longAsciiFirst   thrpt    6   0.046 ± 0.001  ops/us
QueryStringEncoderBenchmark.longUtf8         thrpt    6   0.046 ± 0.001  ops/us
QueryStringEncoderBenchmark.shortAscii       thrpt    6  15.781 ± 0.949  ops/us
QueryStringEncoderBenchmark.shortAsciiFirst  thrpt    6   3.171 ± 0.232  ops/us
QueryStringEncoderBenchmark.shortUtf8        thrpt    6   3.900 ± 0.667  ops/us
```

Before
```
Benchmark                                     Mode  Cnt   Score    Error   Units
QueryStringEncoderBenchmark.longAscii        thrpt    6   0.444 ±  0.072  ops/us
QueryStringEncoderBenchmark.longAsciiFirst   thrpt    6   0.043 ±  0.002  ops/us
QueryStringEncoderBenchmark.longUtf8         thrpt    6   0.047 ±  0.001  ops/us
QueryStringEncoderBenchmark.shortAscii       thrpt    6  16.503 ±  1.015  ops/us
QueryStringEncoderBenchmark.shortAsciiFirst  thrpt    6   3.316 ±  0.154  ops/us
QueryStringEncoderBenchmark.shortUtf8        thrpt    6   3.776 ±  0.956  ops/us
```
2019-12-20 08:51:26 +01:00
Anuraag Agrawal
0f42eb1ceb Use array to buffer decoded query instead of ByteBuffer. (#9886)
Motivation:

In Java, it is almost always at least slower to use `ByteBuffer` than `byte[]` without pooling or I/O. `QueryStringDecoder` can use `byte[]` with arguably simpler code.

Modification:

Replace `ByteBuffer` / `CharsetDecoder` with `byte[]` and `new String`

Result:

After
```
Benchmark                                   Mode  Cnt  Score   Error   Units
QueryStringDecoderBenchmark.noDecoding     thrpt    6  5.612 ± 2.639  ops/us
QueryStringDecoderBenchmark.onlyDecoding   thrpt    6  1.393 ± 0.067  ops/us
QueryStringDecoderBenchmark.mixedDecoding  thrpt    6  1.223 ± 0.048  ops/us
```

Before
```
Benchmark                                   Mode  Cnt  Score   Error   Units
QueryStringDecoderBenchmark.noDecoding     thrpt    6  6.123 ± 0.250  ops/us
QueryStringDecoderBenchmark.onlyDecoding   thrpt    6  0.922 ± 0.159  ops/us
QueryStringDecoderBenchmark.mixedDecoding  thrpt    6  1.032 ± 0.178  ops/us
```

I notice #6781 switched from an array to `ByteBuffer` but I can't find any motivation for that in the PR. Unit tests pass fine with an array and we get a reasonable speed bump.
2019-12-18 21:15:44 +01:00
Nick Hill
d370d48d4a Update to latest JMH version (#9787)
Motivation

JMH 1.22 was released recently, we might as well use the latest when
running benchmarks.

Summary of changes:
https://mail.openjdk.java.net/pipermail/jmh-dev/2019-November/002879.html

Modifications

Update jmh dependencies in microbench module from version 1.21 to 1.22.

Result

Benchmarks run using latest JMH
2019-11-19 11:28:36 +01:00
康智冬
1c69448e2e Fix typos in javadocs (#9527)
Motivation:

We should have correct docs without typos

Modification:

Fix typos and spelling

Result:

More correct docs
2019-10-09 15:25:41 +02:00
jingene
af614e4d6e Change the netty.io homepage scheme(http -> https) (#9344)
Motivation:

Netty homepage(netty.io) serves both "http" and "https".
It's recommended to use https than http.
Modification:

I changed from "http://netty.io" to "https://netty.io"
Result:

No effects.
2019-07-09 21:10:14 +02:00
Norman Maurer
c5a602b272 Increase maxHeaderListSize for HpackDecoderBenchmark to be able to be… (#9321)
Motivation:

The previous used maxHeaderListSize was too low which resulted in exceptions during the benchmark run:

```
io.netty.handler.codec.http2.Http2Exception: Header size exceeded max allowed size (8192)
	at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:103)
	at io.netty.handler.codec.http2.Http2Exception.headerListSizeError(Http2Exception.java:188)
	at io.netty.handler.codec.http2.Http2CodecUtil.headerListSizeExceeded(Http2CodecUtil.java:231)
	at io.netty.handler.codec.http2.HpackDecoder$Http2HeadersSink.finish(HpackDecoder.java:545)
	at io.netty.handler.codec.http2.HpackDecoder.decode(HpackDecoder.java:132)
	at io.netty.handler.codec.http2.HpackDecoderBenchmark.decode(HpackDecoderBenchmark.java:85)
	at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_thrpt_jmhStub(HpackDecoderBenchmark_decode_jmhTest.java:120)
	at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_Throughput(HpackDecoderBenchmark_decode_jmhTest.java:83)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

```

Also we should ensure we only use ascii for header names.

Modifications:

Just use Integer.MAX_VALUE as limit

Result:

Be able to run benchmark without exceptions
2019-07-04 11:24:37 +02:00
Carl Mastrangelo
65d8ecc3a0 Use Table lookup for HPACK decoder (#9307)
Motivation:
Table based decoding is fast.

Modification:
Use table based decoding in HPACK decoder, inspired by
https://github.com/python-hyper/hpack/blob/master/hpack/huffman_table.py

This modifies the table to be based on integers, rather than 3-tuples of
bytes.  This is for two reasons:

1.  It's faster
2.  Using bytes makes the static intializer too big, and doesn't
compile.

Result:
Faster Huffman decoding.  This only seems to help the ascii case, the
other decoding is about the same.

Benchmarks:

```
Before:
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score       Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   426293.636 ±  1444.843  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    57843.738 ±   725.704  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3002.412 ±    16.998  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   412339.400 ±  1128.394  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    58226.870 ±   199.591  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3044.256 ±    10.675  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2082615.030 ±  5929.726  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   10   571640.454 ± 26499.229  ops/s
HpackDecoderBenchmark.decode           false         true   LARGE  thrpt   20    92714.555 ±  2292.222  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1745872.421 ±  6788.840  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   20   490420.323 ±  2455.431  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   20    84536.200 ±   398.714  ops/s

After(bytes):
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score      Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   472649.148 ± 7122.461  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    66739.638 ±  341.607  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3139.773 ±   24.491  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   466933.833 ± 4514.971  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    66111.778 ±  568.326  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3143.619 ±    3.332  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2109995.177 ± 6203.143  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   20   586026.055 ± 1578.550  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1775723.270 ± 4932.057  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   20   493316.467 ± 1453.037  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   10    85726.219 ±  402.573  ops/s

After(ints):
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score       Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   615549.006 ±  5282.283  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    86714.630 ±   654.489  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3984.439 ±    61.612  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   602489.337 ±  5397.024  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    88399.109 ±   241.115  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3875.729 ±   103.057  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2092165.454 ± 11918.859  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   20   583465.437 ±  5452.115  ops/s
HpackDecoderBenchmark.decode           false         true   LARGE  thrpt   20    93290.061 ±   665.904  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1758402.495 ± 14677.438  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   10   491598.099 ±  5029.698  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   20    85834.290 ±   554.915  ops/s
```
2019-07-02 20:13:19 +02:00
jimin
78adeb5408 All override methods must be added @override (#9285)
Motivation:

Some methods that either override others or are implemented as part of implementation an interface did miss the `@Override` annotation

Modifications:

Add missing `@Override`s

Result:

Code cleanup
2019-06-27 13:52:06 +02:00
Alex Blewitt
e233407e01 Replace accumulation with blackhole.consume (#9275)
Motivation:

SpotJMHBugs reports that accumulating a value as a way of eliding dead code
elimination may be inadvisable, as discussed in
`JMHSample_34_SafeLooping::measureWrong_2`. Change the test so that it consumes
the response with `Blackhole::consume` instead.

Modifications:

- Replace addition of results with explicit `blackhole.consume()` call

Result:

Tests work as before, but with different benchmark numbers.
2019-06-25 21:46:26 +02:00
Francesco Nigro
c6114786ab Documented non-usage of BlackHole::consume on ByteBufAccessBenchmark (#9279)
Motivation:

Some JMH benchmarks need additional explanations to motivate
specific code choices.

Modifications:

Introduced comment to explai why calling BlackHole::consume
in a loop is not always the right choice for some benchmark.

Result:

The relevant method shows a comment that warn about changing
the code to introduce BlackHole::consume in the loop.
2019-06-25 14:53:12 +02:00
Alex Blewitt
99034a15b5 Return the result of the list.recycle() call (#9264)
Motivation:

Resolve the issue highlighted by SpotJMHBugs that the creation of the RecyclableArrayList may be elided by the JIT since the result isn't consumed or returned.

Modifications:

Return the result of `list.recycle()` so that the list isn't elided.

Result:

The JMH benchmark shows a change in performance indicating that the prior results of this may be unsound.
2019-06-22 07:21:51 +02:00
Carl Mastrangelo
f01278616a Properly debounce wakeups (#9191)
Motivation:
The wakeup logic in EpollEventLoop is overly complex

Modification:
* Simplify the race to wakeup the loop
* Dont let the event loop wake up itself (it's already awake!)
* Make event loop check if there are any more tasks after preparing to
sleep.  There is small window where the non-eventloop writers can issue
eventfd writes here, but that is okay.

Result:
Cleaner wakeup logic.

Benchmarks:

```
BEFORE
Benchmark                                   Mode  Cnt       Score      Error  Units
EpollSocketChannelBenchmark.executeMulti   thrpt   20  408381.411 ± 2857.498  ops/s
EpollSocketChannelBenchmark.executeSingle  thrpt   20  157022.360 ± 1240.573  ops/s
EpollSocketChannelBenchmark.pingPong       thrpt   20   60571.704 ±  331.125  ops/s

Benchmark                                   Mode  Cnt       Score      Error  Units
EpollSocketChannelBenchmark.executeMulti   thrpt   20  440546.953 ± 1652.823  ops/s
EpollSocketChannelBenchmark.executeSingle  thrpt   20  168114.751 ± 1176.609  ops/s
EpollSocketChannelBenchmark.pingPong       thrpt   20   61231.878 ±  520.108  ops/s
```
2019-06-04 05:27:15 -07:00
Nick Hill
23554e6997 Ensure "full" ownership of msgs passed to EmbeddedChannel.writeInbound() (#9058)
Motivation

Pipeline handlers are free to "take control" of input buffers if they have singular refcount - in particular to mutate their raw data if non-readonly via discarding of read bytes, etc.

However there are various places (primarily unit tests) where a wrapped byte-array buffer is passed in and the wrapped array is assumed not to change (used after the wrapped buffer is passed to EmbeddedChannel.writeInbound()). This invalid assumption could result in unexpected errors, such as those exposed by #8931.

Modifications

Anywhere that the data passed to writeInbound() might be used again, ensure that either:
- A copy is used rather than wrapping a shared byte array, or
- The buffer is otherwise protected from modification by making it read-only

For the tests, copying is preferred since it still allows the "mutating" optimizations to be exercised.

Results

Avoid possible errors when pipeline assumes it has full control of input buffer.
2019-05-22 12:35:03 +02:00
Francesco Nigro
635fc9eae0 The benchmark is not taking into account nanoTime granularity (#9033)
Motivation:

Results are just wrong for small delays.

Modifications:

Switching to AvarageTime avoid to rely on OS nanoTime granularity.

Result:

Uncontended low delay results are not reliable
2019-04-15 15:15:08 +02:00
Norman Maurer
0f34345347
Merge ChannelInboundHandler and ChannelOutboundHandler into ChannelHa… (#8957)
Motivation:

In 42742e233f we already added default methods to Channel*Handler and deprecated the Adapter classes to simplify the class hierarchy. With this change we go even further and merge everything into just ChannelHandler. This simplifies things even more in terms of class-hierarchy.

Modifications:

- Merge ChannelInboundHandler | ChannelOutboundHandler into ChannelHandler
- Adjust code to just use ChannelHandler
- Deprecate old interfaces.

Result:

Cleaner and simpler code in terms of class-hierarchy.
2019-03-28 09:28:27 +00:00
Norman Maurer
42742e233f
Deprecate ChannelInboundHandlerAdapter and ChannelOutboundHandlerAdapter (#8929)
Motivation:

As we now us java8 as minimum java version we can deprecate ChannelInboundHandlerAdapter / ChannelOutboundHandlerAdapter and just move the default implementations into the interfaces. This makes things a bit more flexible for the end-user and also simplifies the class-hierarchy.

Modifications:

- Mark ChannelInboundHandlerAdapter and ChannelOutboundHandlerAdapter as deprecated
- Add default implementations to ChannelInboundHandler / ChannelOutboundHandler
- Refactor our code to not use ChannelInboundHandlerAdapter / ChannelOutboundHandlerAdapter anymore

Result:

Cleanup class-hierarchy and make things a bit more flexible.
2019-03-13 09:46:10 +01:00
Norman Maurer
c6b372f517 Use maven plugin to prevent API/ABI breakage as part of build process (#8904)
Motivation:

Netty is very widely used which can lead to a lot of pain when we break API / ABI. We should make use japicmp-maven-plugin during the build to verify we do not introduce breakage by mistake.

Modifications:

- Add japicmp-maven-plugin to the build process
- Fix a method signature change in HttpProxyHandler that was flagged as a possible problem.

Result:

Ensure no API/ABI breakage accour between releases.
2019-03-01 19:48:29 +01:00
Nick Hill
35161ad174 Further reduce ensureAccessible() overhead (#8895)
Motivation:

This PR fixes some non-negligible overhead discovered in the ByteBuf
accessibility (non-zero refcount) checking. The cause turned out to be
mostly twofold:
- Unnecessary operations used to calculate the refcount from the "raw"
encoded int field value
- Call stack depths exceeding the default limit for inlining, in some
places (CompositeByteBuf in particular)

It's a follow-on from #8882 which uses the maxCapacity field for a
simpler non-negative check. The performance gap between these two
variants appears to be _mostly_ closed, but there's one exception which
may warrant further analysis.

Modifications:

- Replace ABB.internalRefCount() with ByteBuf.isAccessible(), the
default still checks for non-zero refCnt()
- Just test for parity of raw refCnt instead of converting to "real",
with fast-path for specific small values
- Make sure isAccessible() is delegated by derived/wrapper ByteBufs
- Use existing freed flag in CompositeByteBuf for faster isAccessible()
- Manually inline some calls in methods like CompositeByteBuf.setLong()
and AbstractReferenceCountedByteBuf.isAccessible() to reduce stack
depths (to ensure default inlining limit isn't hit)
- Add ByteBufAccessBenchmark which is an extension of
UnsafeByteBufBenchmark (maybe latter could now be removed)

Results:

Before:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
84524972.863 ±   518338.811  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
38608795.037 ±   298176.974  ops/s
readBatch           HEAP               true           true  thrpt   30
80003697.649 ±   974674.119  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
18495554.788 ±   108075.023  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
247069881.578 ± 10839162.593  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
196355905.206 ±  1802420.990  ops/s
setGetLong          HEAP               true           true  thrpt   30
245686644.713 ± 11769311.527  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
83170940.687 ±   657524.123  ops/s
setLong           UNSAFE               true           true  thrpt   30
278940253.918 ±  1807265.259  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
202556738.764 ± 11887973.563  ops/s
setLong             HEAP               true           true  thrpt   30
280045958.053 ±  2719583.400  ops/s
setLong        COMPOSITE               true           true  thrpt   30
121299806.002 ±  2155084.707  ops/s


After:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
101641801.035 ±  3950050.059  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
84395902.846 ±  4339579.057  ops/s
readBatch           HEAP               true           true  thrpt   30
100179060.207 ±  3222487.287  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
42288494.472 ±   294919.633  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
304530755.027 ±  6574163.899  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
212028547.645 ± 14277828.768  ops/s
setGetLong          HEAP               true           true  thrpt   30
309335422.609 ±  2272150.415  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
160383609.236 ±   966484.033  ops/s
setLong           UNSAFE               true           true  thrpt   30
298055969.747 ±  7437449.627  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
223784178.650 ±  9869750.095  ops/s
setLong             HEAP               true           true  thrpt   30
302543263.328 ±  8140104.706  ops/s
setLong        COMPOSITE               true           true  thrpt   30
157083673.285 ±  3528779.522  ops/s

There's also a similar knock-on improvement to other benchmarks (e.g.
HPACK encoding/decoding) as shown in #8882.

For sanity I did a final comparison of the "fast path" tweak using one
of the HPACK benchmarks:

(rawCnt & 1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score     Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  50914.479 ± 940.114  ops/s


rawCnt == 2 || rawCnt == 4 || rawCnt == 6 || rawCnt == 8 ||  (rawCnt &
1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score      Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  60036.425 ± 1478.196  ops/s
2019-02-28 20:41:16 +01:00
Dmitriy Dumanskiy
116f72db8d Legacy properties removed (#8839)
Motivation:

We can remove some properties for which we introduced replacements.

Modifications:

io.netty.buffer.bytebuf.checkAccessible, io.netty.leakDetectionLevel, org.jboss.netty.tryUnsafe properties removed

Result:

Code cleanup
2019-02-04 13:56:15 +01:00
田欧
e8efcd82a8 migrate java8: use requireNonNull (#8840)
Motivation:

We can just use Objects.requireNonNull(...) as a replacement for ObjectUtil.checkNotNull(....)

Modifications:

- Use Objects.requireNonNull(...)

Result:

Less code to maintain.
2019-02-04 10:32:25 +01:00
Dmitriy Dumanskiy
2e433889b2 Improve DateFormatter parsing performance (#8821)
Motivation:

Just was looking through code and found 1 interesting place DateFormatter.tryParseMonth that was not very effective, so I decided to optimize it a bit.

Modification:

Changed DateFormatter.tryParseMonth method. Instead of invocation regionMatch() for every month - compare chars one by one.

Result:

DateFormatter.parseHttpDate method performance improved from ~3% to ~15%.

Benchmark                                                                (DATE_STRING)   Mode  Cnt        Score       Error  Units
DateFormatter2Benchmark.parseHttpHeaderDateFormatter     Sun, 27 Jan 2016 19:18:46 GMT  thrpt    6  4142781.221 ± 82155.002  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatter     Sun, 27 Dec 2016 19:18:46 GMT  thrpt    6  3781810.558 ± 38679.061  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew  Sun, 27 Jan 2016 19:18:46 GMT  thrpt    6  4372569.705 ± 30257.537  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew  Sun, 27 Dec 2016 19:18:46 GMT  thrpt    6  4339785.100 ± 57542.660  ops/s
2019-02-04 10:04:35 +01:00
Norman Maurer
e3846c54f6
Remove ChannelHandler.exceptionCaught(...) as it should only exist in… (#8822)
Motivation:

ChannelHandler.exceptionCaught(...) was marked as @deprecated as it should only exist in inbound handlers.

Modifications:

Remove ChannelHandler.exceptionCaught(...) and adjust code / tests.

Result:

Fixes https://github.com/netty/netty/issues/8527
2019-01-31 20:29:17 +01:00
Dmitriy Dumanskiy
67b23ab056 Remove HttpHeaderDateFormat class (#8807)
Motivation:

HttpHeaderDateFormat was replaced with DateFormatter many days ago and now can be easily removed.

Modification:

Remove deprecated class and related test / benchmark

Result:

Less code to maintain
2019-01-31 07:22:20 +01:00
田欧
6222101924 migrate java8: use lambda and method reference (#8781)
Motivation:

We can use lambdas now as we use Java8.

Modification:

use lambda function for all package, #8751 only migrate transport package.

Result:

Code cleanup.
2019-01-29 14:06:05 +01:00
Norman Maurer
310f31b392
Update to new checkstyle plugin (#8777)
Motivation:

We need to update to a new checkstyle plugin to allow the usage of lambdas.

Modifications:

- Update to new plugin version.
- Fix checkstyle problems.

Result:

Be able to use checkstyle plugin which supports new Java syntax.
2019-01-24 16:24:19 +01:00
Norman Maurer
3d6e6136a9
Decouple EventLoop details from the IO handling for each transport to… (#8680)
* Decouble EventLoop details from the IO handling for each transport to allow easy re-use of code and customization

Motiviation:

As today extending EventLoop implementations to add custom logic / metrics / instrumentations is only possible in a very limited way if at all. This is due the fact that most implementations are final or even package-private. That said even if these would be public there are the ability to do something useful with these is very limited as the IO processing and task processing are very tightly coupled. All of the mentioned things are a big pain point in netty 4.x and need improvement.

Modifications:

This changeset decoubled the IO processing logic from the task processing logic for the main transport (NIO, Epoll, KQueue) by introducing the concept of an IoHandler. The IoHandler itself is responsible to wait for IO readiness and process these IO events. The execution of the IoHandler itself is done by the SingleThreadEventLoop as part of its EventLoop processing. This allows to use the same EventLoopGroup (MultiThreadEventLoupGroup) for all the mentioned transports by just specify a different IoHandlerFactory during construction.

Beside this core API change this changeset also allows to easily extend SingleThreadEventExecutor / SingleThreadEventLoop to add custom logic to it which then can be reused by all the transports. The ideas are very similar to what is provided by ScheduledThreadPoolExecutor (that is part of the JDK). This allows for example things like:

  * Adding instrumentation / metrics:
    * how many Channels are registered on an SingleThreadEventLoop
    * how many Channels were handled during the IO processing in an EventLoop run
    * how many task were handled during the last EventLoop / EventExecutor run
    * how many outstanding tasks we have
    ...
    ...
  * Implementing custom strategies for choosing the next EventExecutor / EventLoop to use based on these metrics.
  * Use different Promise / Future / ScheduledFuture implementations
  * decorate Runnable / Callables when submitted to the EventExecutor / EventLoop

As a lot of functionalities are folded into the MultiThreadEventLoopGroup and SingleThreadEventLoopGroup this changeset also removes:

  * AbstractEventLoop
  * AbstractEventLoopGroup
  * EventExecutorChooser
  * EventExecutorChooserFactory
  * DefaultEventLoopGroup
  * DefaultEventExecutor
  * DefaultEventExecutorGroup

Result:

Fixes https://github.com/netty/netty/issues/8514 .
2019-01-23 08:32:05 +01:00
Dmitriy Dumanskiy
7b92ff2500 Java 8 migration. Remove ThreadLocalProvider and inline java.util.concurrent.ThreadLocalRandom.current() where necessary. (#8762)
Motivation:

Custom Netty ThreadLocalRandom and ThreadLocalRandomProvider classes are no longer needed and can be removed.

Modification:

Remove own ThreadLocalRandom

Result:

Less code to maintain
2019-01-22 20:14:28 +01:00
田欧
9d62deeb6f Java 8 migration: Use diamond operator (#8749)
Motivation:

We can use the diamond operator these days.

Modification:

Use diamond operator whenever possible.

Result:

More modern code and less boiler-plate.
2019-01-22 16:07:26 +01:00
Norman Maurer
8fdf373557
Skip execution of Channel*Handler method if annotated with @Skip and just use the next handler in the pipeline. (#8723)
Motivation:

Invoking ChannelHandlers is not free and can result in some overhead when the ChannelPipeline becomes very long. This is especially true if most handlers will just forward the call to the next handler in the pipeline. When the user extends Channel*HandlerAdapter we can easily detect if can just skip the handler and invoke the next handler in the pipeline directly. This reduce the overhead of dispatch but also reduce the call-stack in many cases.

Modifications:

Detect if we can skip the handler when walking the pipeline.

Result:

Reduce overhead for long pipelines.

Benchmark                                       (extraHandlers)   Mode  Cnt       Score      Error  Units
DefaultChannelPipelineBenchmark.propagateEventOld             4  thrpt   10  267313.031 ± 9131.140  ops/s
DefaultChannelPipelineBenchmark.propagateEvent                4  thrpt   10  824825.673 ± 12727.594  ops/s
2019-01-22 08:58:58 +01:00
Norman Maurer
1fe931b6e2
Make it possible to use a wrapped EventLoop with a Channel (#8677)
Motiviation:

Because of how we implemented the registration / deregistration of an EventLoop it was not possible to wrap an EventLoop implementation and use it with a Channel.

Modification:

- Introduce EventLoop.Unsafe which is responsible for the actual registration.
- Move validation of EventLoop / Channel combo to the EventLoop
- Add unit test that verifies that wrapping works

Result:

Be able to wrap an EventLoop and so add some extra functionality.
2019-01-17 09:17:51 +01:00
Norman Maurer
c10ccc5dec
Tighten contract between Channel and EventLoop by require the EventLoop on Channel construction. (#8587)
Motivation:

At the moment it’s possible to have a Channel in Netty that is not registered / assigned to an EventLoop until register(...) is called. This is suboptimal as if the Channel is not registered it is also not possible to do anything useful with a ChannelFuture that belongs to the Channel. We should think about if we should have the EventLoop as a constructor argument of a Channel and have the register / deregister method only have the effect of add a Channel to KQueue/Epoll/... It is also currently possible to deregister a Channel from one EventLoop and register it with another EventLoop. This operation defeats the threading model assumptions that are wide spread in Netty, and requires careful user level coordination to pull off without any concurrency issues. It is not a commonly used feature in practice, may be better handled by other means (e.g. client side load balancing), and therefore we propose removing this feature.

Modifications:

- Change all Channel implementations to require an EventLoop for construction ( + an EventLoopGroup for all ServerChannel implementations)
- Remove all register(...) methods from EventLoopGroup
- Add ChannelOutboundInvoker.register(...) which now basically means we want to register on the EventLoop for IO.
- Change ChannelUnsafe.register(...) to not take an EventLoop as parameter (as the EventLoop is supplied on custruction).
- Change ChannelFactory to take an EventLoop to create new Channels and introduce ServerChannelFactory which takes an EventLoop and one EventLoopGroup to create new ServerChannel instances.
- Add ServerChannel.childEventLoopGroup()
- Ensure all operations on the accepted Channel is done in the EventLoop of the Channel in ServerBootstrap
- Change unit tests for new behaviour

Result:

A Channel always has an EventLoop assigned which will never change during its life-time. This ensures we are always be able to call any operation on the Channel once constructed (unit the EventLoop is shutdown). This also simplifies the logic in DefaultChannelPipeline a lot as we can always call handlerAdded / handlerRemoved directly without the need to wait for register() to happen.

Also note that its still possible to deregister a Channel and register it again. It's just not possible anymore to move from one EventLoop to another (which was not really safe anyway).

Fixes https://github.com/netty/netty/issues/8513.
2019-01-14 20:11:13 +01:00
Norman Maurer
d9a6cf341c
Remove support for marking reader and writerIndex in ByteBuf to reduce overhead and complexity. (#8636)
Motivation:

ByteBuf supports “marker indexes”. The intended use case for these is if a speculative operation (e.g. decode) is in process the user can “mark” and interface and refer to it later if the operation isn’t successful (e.g. not enough data). However this is rarely used in practice,
requires extra memory to maintain, and introduces complexity in the state management for derived/pooled buffer initialization, resizing, and other operations which may modify reader/writer indexes.

Modifications:

Remove support for marking and adjust testcases / code.

Result:

Fixes https://github.com/netty/netty/issues/8535.
2018-12-11 14:00:49 +01:00
Francesco Nigro
4c2b11633a Adding an execute burst cost benchmark for Netty executors (#8594)
Motivation:

Netty executors doesn't have yet any means to compare with each others
nor to compare with the j.u.c. executors

Modifications:

A new benchmark measuring execute burst cost is being added

Result:

It's now possible to compare some of Netty executors with each others
and with the j.u.c. executors
2018-12-04 15:46:48 +01:00
Norman Maurer
2c78dde749 Update version number to start working on Netty 5 2018-11-20 15:49:57 +01:00
Nick Hill
10539f4dc7 Streamline CompositeByteBuf internals (#8437)
Motivation:

CompositeByteBuf is a powerful and versatile abstraction, allowing for
manipulation of large data without copying bytes. There is still a
non-negligible cost to reading/writing however relative to "singular"
ByteBufs, and this can be mostly eliminated with some rework of the
internals.

My use case is message modification/transformation while zero-copy
proxying. For example replacing a string within a large message with one
of a different length

Modifications:

- No longer slice added buffers and unwrap added slices
   - Components store target buf offset relative to position in
composite buf
   - Less allocations, object footprint, pointer indirection, offset
arithmetic
- Use Component[] rather than ArrayList<Component>
   - Avoid pointer indirection and duplicate bounds check, more
efficient backing array growth
   - Facilitates optimization when doing bulk-inserts - inserting n
ByteBufs behind m is now O(m + n) instead of O(mn)
- Avoid unnecessary casting and method call indirection via superclass
- Eliminate some duplicate range/ref checks via non-checking versions of
toComponentIndex and findComponent
- Add simple fast-path for toComponentIndex(0); add racy cache of
last-accessed Component to findComponent(int)
- Override forEachByte0(...) and forEachByteDesc0(...) methods
- Make use of RecyclableArrayList in nioBuffers(int, int) (in line with
FasterCompositeByteBuf impl)
- Modify addComponents0(boolean,int,Iterable) to use the Iterable
directly rather than copy to an array first (and possibly to an
ArrayList before that)
- Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform
repeated array insertions and avoid second loop for offset updates
- Simplify other logic in various places, in particular the general
pattern used where a sub-range is iterated over
- Add benchmarks to demonstrate some improvements

While refactoring I also came across a couple of clear bugs. They are
fixed in these changes but I will open another PR with unit tests and
fixes to the current version.

Result:

Much faster creation, manipulation, and access; many fewer allocations
and smaller footprint. Benchmark results to follow.
2018-11-03 10:37:07 +01:00