Commit Graph

369 Commits

Author SHA1 Message Date
Frédéric Brégier
d421ae10d7 Minimize get byte multipart and fix buffer reuse (#11001)
Motivation:
- Underlying buffer usages might be erroneous when releasing them internaly
in HttpPostMultipartRequestDecoder.

2 bugs occurs:
1) Final File upload seems not to be of the right size.
2) Memory, even in Disk mode, is increasing continuously, while it shouldn't.

- Method `getByte(position)` is too often called within the current implementation
of the HttpPostMultipartRequestDecoder.
This implies too much activities which is visible when PARANOID mode is active.
This is also true in standard mode.

Apply the same fix on buffer from HttpPostMultipartRequestDecoder to HttpPostStandardRequestDecoder
made previously.

Finally in order to ensure we do not rewrite already decoded HttpData when decoding
next ones within multipart, we must ensure the buffers are copied and not a retained slice.

Modification:
- Add some tests to check consistency for HttpPostMultipartRequestDecoder.
Add a package protected method for testing purpose only.

- Use the `bytesBefore(...)` method instead of `getByte(pos)` in order to limit the external
access to the underlying buffer by retrieving iteratively the beginning of a correct start
position.
It is used to find both LF/CRLF and delimiter.
2 methods in HttpPostBodyUtil were created for that.

The undecodedChunk is copied when adding a chunk to a DataMultipart is loaded.
The same buffer is also rewritten in order to release the copied memory part.

Result:

Just for note, for both Memory or Disk or Mixed mode factories, the release has to be done as:

      for (InterfaceHttpData httpData: decoder.getBodyHttpDatas()) {
          httpData.release();
          factory.removeHttpDataFromClean(request, httpData);
      }
      factory.cleanAllHttpData();
      decoder.destroy();

The memory used is minimal in Disk or Mixed mode. In Memory mode, a big file is still
in memory but not more in the undecodedChunk but its own buffer (copied).

In terms of benchmarking, the results are:

Original code Benchmark                                                             Mode  Cnt  Score    Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  0,152 ±  0,100  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  0,543 ±  0,218  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  0,615 ±  0,070  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  0,114 ±  0,063  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  0,664 ±  0,034  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,001 ±  0,001  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  0,620 ±  0,140  ops/ms

New code Benchmark                                                                  Mode  Cnt  Score   Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  4,037 ± 0,358  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  4,226 ± 0,471  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  0,875 ± 0,029  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  4,346 ± 0,275  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  2,044 ± 0,020  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  2,278 ± 0,159  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,174 ± 0,004  ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  2,370 ± 0,065  ops/ms

In short, using big file transfers, this is about 7 times faster with new code, while
using high number of HttpData, this is about 4 times faster with new code when using Simple Level.
When using Paranoid Level, using big file transfers, this is about 800 times faster with new code, while
using high number of HttpData, this is about 170 times faster with new code.
2021-02-26 14:26:24 +01:00
Francesco Nigro
517da28740 DecodeHexBenchmark is too branch-predictor friendly (#9942)
Motivation:

DecodeHexBenchmark needs to be less branch-predictor friendly
to mimic the "real" behaviour while decoding

Modifications:

DecodeHexBenchmark uses a larger sets of inputs, picking them at
random on each iteration and the benchmarked method is made !inlineable

Result:

DecodeHexBenchmark is more trusty while showing the performance
difference between different decoding methods
2021-02-05 15:28:25 +01:00
Francesco Nigro
5337d3eeb4 Implement SWAR indexOf byte search (#10737)
Motivation:

Faster indexOf

Modification:

Create generic SWAR indexOf that any ByteBuf implementation can use

Result:

Fixes #10731
2021-01-15 15:09:50 +01:00
Oleksii Kachaiev
85ec20ecbd Improve performance of HPACK static table lookup (#10840)
Motivation:

HPACK static table is organized in a way that fields with the same
name are sequential. Which means when doing sequential scan we can
short-circuit scan on name mismatch.

Modifications:

* `HpackStaticTable.getIndexIndensitive` returns -1 on name mismatch
rather than keep scanning.
* `HpackStaticTable` statically defined max position in the array
where name duplication is possible (after the given index there's
no need to check for other fields with the same name)
* Benchmark for different lookup patterns

Result:

Better HPACK static table lookup performance.

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
2020-12-21 15:34:31 +01:00
Andrey Mizurov
2877eef5d5 Provide ability to extend StompSubframeEncoder and improve full stomp frame encoding (allocate one buffer for full frame considering the size of the headers) (#10778)
Motivation:

At the moment `StompSubframeEncoder` encode a frame only to `ByteBuf` it is not convenient if further we need to convert it to another type of message,  e.g. `WebSocketFrame`. Also, if we send a full frame, it splits into two headers and a content what makes it difficult to convert it in the next handler.

Modification:

Introduce additional converter methods e.g. (`Object protected convertFullFrame(StompFrame original, ByteBuf encoded`)...) for extending encoder functionality and allocate only one `ByteBuf` for full stomp frame. Change headers size calculation, previously used only 256 bytes that reallocate a new buffer each time when headers size more than this threshold. Add `StompEncoderBenchmark`.

Result:

Improved  `StompSubframeEncoder` fro extensions.

Previous version benchmark
```
Benchmark                              (contentLength)  (headersType)  (pooledAllocator)   Mode  Cnt        Score        Error  Units
StompEncoderBenchmark.writeStompFrame                0            ONE               true  thrpt   10  4432132.884 ± 178923.436  ops/s
StompEncoderBenchmark.writeStompFrame                0            ONE              false  thrpt   10  1281122.756 ±  52484.174  ops/s
StompEncoderBenchmark.writeStompFrame                0          THREE               true  thrpt   10  2980897.937 ± 130253.049  ops/s
StompEncoderBenchmark.writeStompFrame                0          THREE              false  thrpt   10  1116883.574 ±  35471.482  ops/s
StompEncoderBenchmark.writeStompFrame                0          SEVEN               true  thrpt   10  1988012.159 ±  74352.450  ops/s
StompEncoderBenchmark.writeStompFrame                0          SEVEN              false  thrpt   10   881772.343 ±  94633.870  ops/s
StompEncoderBenchmark.writeStompFrame                0         ELEVEN               true  thrpt   10  1048125.919 ± 151053.902  ops/s
StompEncoderBenchmark.writeStompFrame                0         ELEVEN              false  thrpt   10   429900.066 ±  47956.661  ops/s
StompEncoderBenchmark.writeStompFrame                0         TWENTY               true  thrpt   10   660584.122 ± 104973.439  ops/s
StompEncoderBenchmark.writeStompFrame                0         TWENTY              false  thrpt   10   278255.488 ±  20143.708  ops/s
StompEncoderBenchmark.writeStompFrame               10            ONE               true  thrpt   10  4251498.549 ± 625050.979  ops/s
StompEncoderBenchmark.writeStompFrame               10            ONE              false  thrpt   10  1214006.861 ±  60421.601  ops/s
StompEncoderBenchmark.writeStompFrame               10          THREE               true  thrpt   10  3117736.486 ± 173613.974  ops/s
StompEncoderBenchmark.writeStompFrame               10          THREE              false  thrpt   10  1046605.891 ±  94428.064  ops/s
StompEncoderBenchmark.writeStompFrame               10          SEVEN               true  thrpt   10  2006986.881 ± 108456.748  ops/s
StompEncoderBenchmark.writeStompFrame               10          SEVEN              false  thrpt   10   877983.112 ±  82919.387  ops/s
StompEncoderBenchmark.writeStompFrame               10         ELEVEN               true  thrpt   10  1132844.437 ±  84578.571  ops/s
StompEncoderBenchmark.writeStompFrame               10         ELEVEN              false  thrpt   10   429334.649 ±  35403.161  ops/s
StompEncoderBenchmark.writeStompFrame               10         TWENTY               true  thrpt   10   657093.390 ±  48092.947  ops/s
StompEncoderBenchmark.writeStompFrame               10         TWENTY              false  thrpt   10   252140.876 ±  37337.255  ops/s
StompEncoderBenchmark.writeStompFrame              100            ONE               true  thrpt   10  4720507.067 ± 100993.908  ops/s
StompEncoderBenchmark.writeStompFrame              100            ONE              false  thrpt   10  1266182.925 ±  85888.413  ops/s
StompEncoderBenchmark.writeStompFrame              100          THREE               true  thrpt   10  2898746.621 ± 452579.753  ops/s
StompEncoderBenchmark.writeStompFrame              100          THREE              false  thrpt   10  1019555.288 ±  65640.507  ops/s
StompEncoderBenchmark.writeStompFrame              100          SEVEN               true  thrpt   10  2259187.459 ±  20025.989  ops/s
StompEncoderBenchmark.writeStompFrame              100          SEVEN              false  thrpt   10   896405.412 ±  53750.148  ops/s
StompEncoderBenchmark.writeStompFrame              100         ELEVEN               true  thrpt   10  1110670.772 ± 107650.327  ops/s
StompEncoderBenchmark.writeStompFrame              100         ELEVEN              false  thrpt   10   445187.398 ±  28845.959  ops/s
StompEncoderBenchmark.writeStompFrame              100         TWENTY               true  thrpt   10   611506.846 ±  25304.240  ops/s
StompEncoderBenchmark.writeStompFrame              100         TWENTY              false  thrpt   10   247687.007 ±  43471.578  ops/s
StompEncoderBenchmark.writeStompFrame             1000            ONE               true  thrpt   10  4140949.576 ± 270274.087  ops/s
StompEncoderBenchmark.writeStompFrame             1000            ONE              false  thrpt   10  1154515.598 ± 134413.876  ops/s
StompEncoderBenchmark.writeStompFrame             1000          THREE               true  thrpt   10  3349996.875 ± 162309.889  ops/s
StompEncoderBenchmark.writeStompFrame             1000          THREE              false  thrpt   10  1141040.562 ±   5895.693  ops/s
StompEncoderBenchmark.writeStompFrame             1000          SEVEN               true  thrpt   10  2184632.248 ±   8957.833  ops/s
StompEncoderBenchmark.writeStompFrame             1000          SEVEN              false  thrpt   10   959545.704 ±   5835.161  ops/s
StompEncoderBenchmark.writeStompFrame             1000         ELEVEN               true  thrpt   10  1081113.327 ±   3957.527  ops/s
StompEncoderBenchmark.writeStompFrame             1000         ELEVEN              false  thrpt   10   467524.660 ±   1383.236  ops/s
StompEncoderBenchmark.writeStompFrame             1000         TWENTY               true  thrpt   10   568411.797 ± 108712.493  ops/s
StompEncoderBenchmark.writeStompFrame             1000         TWENTY              false  thrpt   10   260764.231 ±  43149.129  ops/s
StompEncoderBenchmark.writeStompFrame            10000            ONE               true  thrpt   10  4369787.147 ± 619367.939  ops/s
StompEncoderBenchmark.writeStompFrame            10000            ONE              false  thrpt   10  1246782.845 ±  47468.764  ops/s
StompEncoderBenchmark.writeStompFrame            10000          THREE               true  thrpt   10  3333328.810 ± 253061.481  ops/s
StompEncoderBenchmark.writeStompFrame            10000          THREE              false  thrpt   10  1108278.988 ±  81905.149  ops/s
StompEncoderBenchmark.writeStompFrame            10000          SEVEN               true  thrpt   10  2062961.266 ± 247096.284  ops/s
StompEncoderBenchmark.writeStompFrame            10000          SEVEN              false  thrpt   10   925199.985 ±  36734.594  ops/s
StompEncoderBenchmark.writeStompFrame            10000         ELEVEN               true  thrpt   10  1223240.034 ±  58833.801  ops/s
StompEncoderBenchmark.writeStompFrame            10000         ELEVEN              false  thrpt   10   460864.117 ±   2361.459  ops/s
StompEncoderBenchmark.writeStompFrame            10000         TWENTY               true  thrpt   10   655864.762 ±  35237.335  ops/s
StompEncoderBenchmark.writeStompFrame            10000         TWENTY              false  thrpt   10   286388.865 ±   1002.460  ops/s
```
A new version benchmark
```
Benchmark                              (contentLength)  (headersType)  (pooledAllocator)   Mode  Cnt        Score        Error  Units
StompEncoderBenchmark.writeStompFrame                0            ONE               true  thrpt   10  4366110.018 ± 420377.867  ops/s
StompEncoderBenchmark.writeStompFrame                0            ONE              false  thrpt   10  1289437.153 ± 215271.656  ops/s
StompEncoderBenchmark.writeStompFrame                0          THREE               true  thrpt   10  2818791.355 ± 218894.471  ops/s
StompEncoderBenchmark.writeStompFrame                0          THREE              false  thrpt   10  1040151.615 ±  75352.695  ops/s
StompEncoderBenchmark.writeStompFrame                0          SEVEN               true  thrpt   10  1842144.001 ±  94668.864  ops/s
StompEncoderBenchmark.writeStompFrame                0          SEVEN              false  thrpt   10   916742.825 ±  65467.820  ops/s
StompEncoderBenchmark.writeStompFrame                0         ELEVEN               true  thrpt   10  1310454.012 ± 100747.490  ops/s
StompEncoderBenchmark.writeStompFrame                0         ELEVEN              false  thrpt   10   679934.001 ±  82168.249  ops/s
StompEncoderBenchmark.writeStompFrame                0         TWENTY               true  thrpt   10   746867.549 ±  68373.269  ops/s
StompEncoderBenchmark.writeStompFrame                0         TWENTY              false  thrpt   10   483316.314 ±  50978.009  ops/s
StompEncoderBenchmark.writeStompFrame               10            ONE               true  thrpt   10  4791698.722 ± 263890.510  ops/s
StompEncoderBenchmark.writeStompFrame               10            ONE              false  thrpt   10  1289877.116 ± 128677.185  ops/s
StompEncoderBenchmark.writeStompFrame               10          THREE               true  thrpt   10  2984662.187 ± 395567.524  ops/s
StompEncoderBenchmark.writeStompFrame               10          THREE              false  thrpt   10  1079028.782 ±  43548.555  ops/s
StompEncoderBenchmark.writeStompFrame               10          SEVEN               true  thrpt   10  1806763.709 ±  59162.209  ops/s
StompEncoderBenchmark.writeStompFrame               10          SEVEN              false  thrpt   10   935274.980 ±  22064.148  ops/s
StompEncoderBenchmark.writeStompFrame               10         ELEVEN               true  thrpt   10  1284172.151 ± 119068.047  ops/s
StompEncoderBenchmark.writeStompFrame               10         ELEVEN              false  thrpt   10   687174.498 ±  30270.916  ops/s
StompEncoderBenchmark.writeStompFrame               10         TWENTY               true  thrpt   10   803843.483 ±  29106.133  ops/s
StompEncoderBenchmark.writeStompFrame               10         TWENTY              false  thrpt   10   502134.552 ±  23653.215  ops/s
StompEncoderBenchmark.writeStompFrame              100            ONE               true  thrpt   10  4337438.694 ± 378524.452  ops/s
StompEncoderBenchmark.writeStompFrame              100            ONE              false  thrpt   10  1289174.213 ±  50640.853  ops/s
StompEncoderBenchmark.writeStompFrame              100          THREE               true  thrpt   10  3232767.156 ± 311934.194  ops/s
StompEncoderBenchmark.writeStompFrame              100          THREE              false  thrpt   10  1115247.028 ±  15683.477  ops/s
StompEncoderBenchmark.writeStompFrame              100          SEVEN               true  thrpt   10  2213147.232 ±  86326.187  ops/s
StompEncoderBenchmark.writeStompFrame              100          SEVEN              false  thrpt   10   901120.188 ±  71344.491  ops/s
StompEncoderBenchmark.writeStompFrame              100         ELEVEN               true  thrpt   10  1238317.714 ±  68148.477  ops/s
StompEncoderBenchmark.writeStompFrame              100         ELEVEN              false  thrpt   10   671336.339 ±  72735.337  ops/s
StompEncoderBenchmark.writeStompFrame              100         TWENTY               true  thrpt   10   754565.791 ±  28574.382  ops/s
StompEncoderBenchmark.writeStompFrame              100         TWENTY              false  thrpt   10   498939.383 ±  38146.118  ops/s
StompEncoderBenchmark.writeStompFrame             1000            ONE               true  thrpt   10  3722594.471 ± 515861.000  ops/s
StompEncoderBenchmark.writeStompFrame             1000            ONE              false  thrpt   10  1265629.633 ±  84113.347  ops/s
StompEncoderBenchmark.writeStompFrame             1000          THREE               true  thrpt   10  2829696.349 ± 172520.267  ops/s
StompEncoderBenchmark.writeStompFrame             1000          THREE              false  thrpt   10  1111454.609 ±  26275.913  ops/s
StompEncoderBenchmark.writeStompFrame             1000          SEVEN               true  thrpt   10  1901506.449 ±  37701.353  ops/s
StompEncoderBenchmark.writeStompFrame             1000          SEVEN              false  thrpt   10   912528.888 ±  46221.215  ops/s
StompEncoderBenchmark.writeStompFrame             1000         ELEVEN               true  thrpt   10  1299674.123 ±  21889.002  ops/s
StompEncoderBenchmark.writeStompFrame             1000         ELEVEN              false  thrpt   10   724527.644 ±   2757.370  ops/s
StompEncoderBenchmark.writeStompFrame             1000         TWENTY               true  thrpt   10   811389.799 ±   2606.626  ops/s
StompEncoderBenchmark.writeStompFrame             1000         TWENTY              false  thrpt   10   504955.449 ±   6737.804  ops/s
StompEncoderBenchmark.writeStompFrame            10000            ONE               true  thrpt   10  3837912.649 ± 380742.919  ops/s
StompEncoderBenchmark.writeStompFrame            10000            ONE              false  thrpt   10  1375544.306 ±   3157.068  ops/s
StompEncoderBenchmark.writeStompFrame            10000          THREE               true  thrpt   10  3224743.448 ± 297369.719  ops/s
StompEncoderBenchmark.writeStompFrame            10000          THREE              false  thrpt   10  1125772.007 ±   4051.498  ops/s
StompEncoderBenchmark.writeStompFrame            10000          SEVEN               true  thrpt   10  2127352.136 ± 106787.777  ops/s
StompEncoderBenchmark.writeStompFrame            10000          SEVEN              false  thrpt   10   934848.418 ±   4564.147  ops/s
StompEncoderBenchmark.writeStompFrame            10000         ELEVEN               true  thrpt   10  1379672.772 ±   8778.640  ops/s
StompEncoderBenchmark.writeStompFrame            10000         ELEVEN              false  thrpt   10   723169.459 ±   2317.767  ops/s
StompEncoderBenchmark.writeStompFrame            10000         TWENTY               true  thrpt   10   802275.113 ±   4155.137  ops/s
StompEncoderBenchmark.writeStompFrame            10000         TWENTY              false  thrpt   10   517604.265 ±   3398.384  ops/s
```
For headers over 256 bytes we get a speedup.
2020-12-07 09:59:17 +01:00
Norman Maurer
2dae6665f4 Fix caching for normal allocations (#10825)
Motivation:

https://github.com/netty/netty/pull/10267 introduced a change that reduced the fragmentation. Unfortunally it also introduced a regression when it comes to caching of normal allocations. This can have a negative performance impact depending on the allocation sizes.

Modifications:

- Fix algorithm to calculate the array size for normal allocation caches
- Correctly calculate indeox for normal caches
- Add unit test

Result:

Fixes https://github.com/netty/netty/issues/10805
2020-11-25 15:09:39 +01:00
Frédéric Brégier
3a58063fe7 Fix for performance regression on HttpPost RequestDecoder (#10623)
Fix issue #10508 where PARANOID mode slow down about 1000 times compared to ADVANCED.
Also fix a rare issue when internal buffer was growing over a limit, it was partially discarded
using `discardReadBytes()` which causes bad changes within previously discovered HttpData.

Reasons were:

Too many `readByte()` method calls while other ways exist (such as keep in memory the last scan position when trying to find a delimiter or using `bytesBefore(firstByte)` instead of looping externally).

Changes done:
- major change on way buffer are parsed: instead of read byte per byte until found delimiter, try to find the delimiter using `bytesBefore()` and keep the last unfound position to skeep already parsed parts (algorithms are the same but implementation of scan are different)
- Change the condition to discard read bytes when refCnt is at most 1.

Observations using Async-Profiler:
==================================

1) Without optimizations, most of the time (more than 95%) is through `readByte()` method within `loadDataMultipartStandard` method.
2) With using `bytesBefore(byte)` instead of `readByte()` to find various delimiter, the `loadDataMultipartStandard` method is going down to 19 to 33% depending on the test used. the `readByte()` method or equivalent `getByte(pos)` method are going down to 15% (from 95%).

Times are confirming those profiling:
- With optimizations, in SIMPLE mode about 82% better, in ADVANCED mode about 79% better and in PARANOID mode about 99% better (most of the duplicate read accesses are removed or make internally through `bytesBefore(byte)` method)

A benchmark is added to show the behavior of the various cases (one big item, such as File upload, and many items) and various level of detection (Disabled, Simple, Advanced, Paranoid). This benchmark is intend to alert if new implementations make too many differences (such as the previous version where about PARANOID gives about 1000 times slower than other levels, while it is now about at most 10 times).

Extract of Benchmark run:
=========================

Run complete. Total time: 00:13:27

Benchmark                                                                           Mode  Cnt  Score   Error   Units
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigAdvancedLevel   thrpt    6  2,248 ± 0,198 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigDisabledLevel   thrpt    6  2,067 ± 1,219 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigParanoidLevel   thrpt    6  1,109 ± 0,038 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderBigSimpleLevel     thrpt    6  2,326 ± 0,314 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighAdvancedLevel  thrpt    6  1,444 ± 0,226 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighDisabledLevel  thrpt    6  1,462 ± 0,642 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighParanoidLevel  thrpt    6  0,159 ± 0,003 ops/ms
HttpPostMultipartRequestDecoderBenchmark.multipartRequestDecoderHighSimpleLevel    thrpt    6  1,522 ± 0,049 ops/ms
2020-11-19 08:01:05 +01:00
Norman Maurer
eeece4cfa5 Use http in xmlns URIs to make maven release plugin happy again (#10788)
Motivation:

https in xmlns URIs does not work and will let the maven release plugin fail:

```
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.779 s
[INFO] Finished at: 2020-11-10T07:45:21Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:2.5.3:prepare (default-cli) on project netty-parent: Execution default-cli of goal org.apache.maven.plugins:maven-release-plugin:2.5.3:prepare failed: The namespace xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" could not be added as a namespace to "project": The namespace prefix "xsi" collides with an additional namespace declared by the element -> [Help 1]
[ERROR]
```

See also https://issues.apache.org/jira/browse/HBASE-24014.

Modifications:

Use http for xmlns

Result:

Be able to use maven release plugin
2020-11-10 10:51:05 +01:00
Chris Vest
a6b749843f Use JUnit 5 for running all tests (#10764)
Motivation:
JUnit 5 is the new hotness. It's more expressive, extensible, and composable in many ways, and it's better able to run tests in parallel. But most importantly, it's able to directly run JUnit 4 tests.
This means we can update and start using JUnit 5 without touching any of our existing tests.
I'm also introducing a dependency on assertj-core, which is like hamcrest, but arguably has a nicer and more discoverable API.

Modification:
Add the JUnit 5 and assertj-core dependencies, without converting any tests at time time.

Result:
All our tests are now executed through the JUnit 5 Vintage Engine.
Also, the JUnit 5 test APIs are available, and any JUnit 5 tests that are added from now on will also be executed.
2020-11-04 10:21:03 +01:00
Artem Smotrakov
b8ae2a2af4 Enable nohttp check during the build (#10708)
Motivation:

HTTP is a plaintext protocol which means that someone may be able
to eavesdrop the data. To prevent this, HTTPS should be used whenever
possible. However, maintaining using https:// in all URLs may be
difficult. The nohttp tool can help here. The tool scans all the files
in a repository and reports where http:// is used.

Modifications:

- Added nohttp (via checkstyle) into the build process.
- Suppressed findings for the websites
  that don't support HTTPS or that are not reachable

Result:

- Prevent using HTTP in the future.
- Encourage users to use HTTPS when they follow the links they found in
  the code.
2020-10-23 15:26:25 +02:00
Francesco Nigro
4624b6309d Reduce DefaultAttributeMap lookup cost (#10530)
Motivation:

DefaultAttributeMap::attr has a blocking behaviour on lookup of an existing attribute:
it can be made non-blocking.

Modification:

Replace the existing fixed bucket table using a locked intrusive linked list
with an hand-rolled copy-on-write ordered single array

Result:
Non blocking behaviour for the lookup happy path
2020-10-02 21:19:03 +02:00
Chris Vest
1d7efbddd9 Fix compilation after forward port of #10368
Motivation: Code failed to compile because ByteBuf index marking has been removed.
Modification: Index marking wasn't really used anyway, so just set the relevant index to zero.
Result: Code compiles again.
2020-09-09 16:27:52 +02:00
Francesco Nigro
7f86f90646 Improve predictability of writeUtf8/writeAscii performance (#10368)
Motivation:

writeUtf8 can suffer from inlining issues and/or megamorphic call-sites on the hot path due to ByteBuf hierarchy

Modifications:

Duplicate and specialize the code paths to reduce the need of polymorphic calls

Result:

Performance are more stable in user code
2020-09-09 16:15:22 +02:00
Francesco Nigro
319a4bc3ba Reduce garbage on MQTT (#10509)
Reduce garbage on MQTT encoding

Motivation:

MQTT encoding and decoding is doing unnecessary object allocation in a number of places:
- MqttEncoder create many byte[] to encode Strings into UTF-8 bytes
- MqttProperties uses Integer keys instead of int
- Some enums valueOf create unnecessary arrays on the hot paths
- MqttDecoder was using unecessary Result<T>

Modification:

- ByteBufUtil::utf8Bytes and ByteBufUtil::reserveAndWriteUtf8 allows to perform the same operation GC-free
- MqttProperties uses a primitive key map
- Implemented GC free const table lookup/switch valueOf
- Use some bit-tricks to pack 2 ints into a single primitive long to store both result and numberOfBytesConsumed and use byte[].length to compute numberOfByteConsumed on fly. These changes allowed to save creating Result<T>.

Result:
Significantly less garbage produced in MQTT encoding/decoding
2020-09-04 18:31:53 +02:00
Francesco Nigro
0a8c9192e5 Improve MqttMessageType::valueOf cost (#10400)
Motivation:

MqttMessageType::valueOf has O(N) cost

Modifications:

MqttMessageType::valueOf uses a const lookup table

Result:

MqttMessageType::valueOf has O(1) cost
2020-08-31 10:32:48 +02:00
Linas Medžiūnas
abdcf102da Efficient BytBuf search algorithms (#9914) (#9955)
Motivation:

We have found out that ByteBufUtil.indexOf can be inefficient for substring search on
ByteBuf, both in terms of algorithm complexity (worst case O(needle.readableBytes *
haystack.readableBytes)), and in constant factor (esp. on Composite buffers).
With implementation of more performant search algorithms we have seen improvements on
the order of magnitude.

Modifications:

This change introduces three search algorithms:
1. Knuth Morris Pratt - classical textbook algorithm, a good default choice.
2. Bit mask based algorithm - stable performance on any input, but limited to maximum
search substring (the needle) length of 64 bytes.
3. Aho–Corasick - worse performance and higher memory consumption than [1] and [2], but
it supports multiple substring (the needles) search simultaneously, by inspecting every
byte of the haystack only once.

Each algorithm processes every byte of underlying buffer only once, they are implemented
as ByteProcessor.

Result:

Efficient search algorithms with linear time complexity available in Netty (I will share
benchmark results in a comment on a PR).
2020-04-15 10:26:53 +02:00
Dmitry Konstantinov
dc69c04434 Replace usage() with freeBytes() in thresholds within hot paths of PoolChunkList (#10141)
Motivation:
PoolChunk.usage() method has non-trivial computations. It is used currently in hot path methods invoked when an allocation and de-allocation are happened.
The idea is to replace usage() output comparison against percent thresholds by Chunk.freeBytes plain comparison against absolute thresholds. In such way the majority of computations from the threshold conditions are moved to init logic.

Modifications:
Replace PoolChunk.usage() conditions in PoolChunkList with equivalent conditions for PoolChunk.freeBytes()

Result:
Improve performance of allocation and de-allocation of ByteBuf from normal size cache pool
2020-03-31 22:11:42 +02:00
Norman Maurer
6a43807843
Use lambdas whenever possible (#9979)
Motivation:

We should update our code to use lamdas whenever possible

Modifications:

Use lambdas when possible

Result:

Cleanup code for Java8
2020-01-30 09:28:24 +01:00
Norman Maurer
9e29c39daa
Cleanup usage of Channel*Handler (#9959)
Motivation:

In next major version of netty users should use ChannelHandler everywhere. We should ensure we do the same

Modifications:

Replace usage of deprecated classes / interfaces with ChannelHandler

Result:

Use non-deprecated code
2020-01-20 17:47:17 -08:00
Francesco Nigro
1e4f0e6a09 Faster decodeHexNibble (#9896)
Motivation:

decodeHexNibble can be a lot faster using a lookup table

Modifications:

decodeHexNibble is made faster by using a lookup table

Result:

decodeHexNibble is faster
2019-12-23 21:16:44 +01:00
Anuraag Agrawal
ee206b6ba8 Separate out query string encoding for non-encoded strings. (#9887)
Motivation:

Currently, characters are appended to the encoded string char-by-char even when no encoding is needed. We can instead separate out codepath that appends the entire string in one go for better `StringBuilder` allocation performance.

Modification:

Only go into char-by-char loop when finding a character that requires encoding.

Result:

The results aren't so clear with noise on my hot laptop - the biggest impact is on long strings, both to reduce resizes of the buffer and also to reduce complexity of the loop. I don't think there's a significant downside though for the cases that hit the slow path.

After
```
Benchmark                                     Mode  Cnt   Score   Error   Units
QueryStringEncoderBenchmark.longAscii        thrpt    6   1.406 ± 0.069  ops/us
QueryStringEncoderBenchmark.longAsciiFirst   thrpt    6   0.046 ± 0.001  ops/us
QueryStringEncoderBenchmark.longUtf8         thrpt    6   0.046 ± 0.001  ops/us
QueryStringEncoderBenchmark.shortAscii       thrpt    6  15.781 ± 0.949  ops/us
QueryStringEncoderBenchmark.shortAsciiFirst  thrpt    6   3.171 ± 0.232  ops/us
QueryStringEncoderBenchmark.shortUtf8        thrpt    6   3.900 ± 0.667  ops/us
```

Before
```
Benchmark                                     Mode  Cnt   Score    Error   Units
QueryStringEncoderBenchmark.longAscii        thrpt    6   0.444 ±  0.072  ops/us
QueryStringEncoderBenchmark.longAsciiFirst   thrpt    6   0.043 ±  0.002  ops/us
QueryStringEncoderBenchmark.longUtf8         thrpt    6   0.047 ±  0.001  ops/us
QueryStringEncoderBenchmark.shortAscii       thrpt    6  16.503 ±  1.015  ops/us
QueryStringEncoderBenchmark.shortAsciiFirst  thrpt    6   3.316 ±  0.154  ops/us
QueryStringEncoderBenchmark.shortUtf8        thrpt    6   3.776 ±  0.956  ops/us
```
2019-12-20 08:51:26 +01:00
Anuraag Agrawal
0f42eb1ceb Use array to buffer decoded query instead of ByteBuffer. (#9886)
Motivation:

In Java, it is almost always at least slower to use `ByteBuffer` than `byte[]` without pooling or I/O. `QueryStringDecoder` can use `byte[]` with arguably simpler code.

Modification:

Replace `ByteBuffer` / `CharsetDecoder` with `byte[]` and `new String`

Result:

After
```
Benchmark                                   Mode  Cnt  Score   Error   Units
QueryStringDecoderBenchmark.noDecoding     thrpt    6  5.612 ± 2.639  ops/us
QueryStringDecoderBenchmark.onlyDecoding   thrpt    6  1.393 ± 0.067  ops/us
QueryStringDecoderBenchmark.mixedDecoding  thrpt    6  1.223 ± 0.048  ops/us
```

Before
```
Benchmark                                   Mode  Cnt  Score   Error   Units
QueryStringDecoderBenchmark.noDecoding     thrpt    6  6.123 ± 0.250  ops/us
QueryStringDecoderBenchmark.onlyDecoding   thrpt    6  0.922 ± 0.159  ops/us
QueryStringDecoderBenchmark.mixedDecoding  thrpt    6  1.032 ± 0.178  ops/us
```

I notice #6781 switched from an array to `ByteBuffer` but I can't find any motivation for that in the PR. Unit tests pass fine with an array and we get a reasonable speed bump.
2019-12-18 21:15:44 +01:00
Nick Hill
d370d48d4a Update to latest JMH version (#9787)
Motivation

JMH 1.22 was released recently, we might as well use the latest when
running benchmarks.

Summary of changes:
https://mail.openjdk.java.net/pipermail/jmh-dev/2019-November/002879.html

Modifications

Update jmh dependencies in microbench module from version 1.21 to 1.22.

Result

Benchmarks run using latest JMH
2019-11-19 11:28:36 +01:00
康智冬
1c69448e2e Fix typos in javadocs (#9527)
Motivation:

We should have correct docs without typos

Modification:

Fix typos and spelling

Result:

More correct docs
2019-10-09 15:25:41 +02:00
jingene
af614e4d6e Change the netty.io homepage scheme(http -> https) (#9344)
Motivation:

Netty homepage(netty.io) serves both "http" and "https".
It's recommended to use https than http.
Modification:

I changed from "http://netty.io" to "https://netty.io"
Result:

No effects.
2019-07-09 21:10:14 +02:00
Norman Maurer
c5a602b272 Increase maxHeaderListSize for HpackDecoderBenchmark to be able to be… (#9321)
Motivation:

The previous used maxHeaderListSize was too low which resulted in exceptions during the benchmark run:

```
io.netty.handler.codec.http2.Http2Exception: Header size exceeded max allowed size (8192)
	at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:103)
	at io.netty.handler.codec.http2.Http2Exception.headerListSizeError(Http2Exception.java:188)
	at io.netty.handler.codec.http2.Http2CodecUtil.headerListSizeExceeded(Http2CodecUtil.java:231)
	at io.netty.handler.codec.http2.HpackDecoder$Http2HeadersSink.finish(HpackDecoder.java:545)
	at io.netty.handler.codec.http2.HpackDecoder.decode(HpackDecoder.java:132)
	at io.netty.handler.codec.http2.HpackDecoderBenchmark.decode(HpackDecoderBenchmark.java:85)
	at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_thrpt_jmhStub(HpackDecoderBenchmark_decode_jmhTest.java:120)
	at io.netty.handler.codec.http2.generated.HpackDecoderBenchmark_decode_jmhTest.decode_Throughput(HpackDecoderBenchmark_decode_jmhTest.java:83)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

```

Also we should ensure we only use ascii for header names.

Modifications:

Just use Integer.MAX_VALUE as limit

Result:

Be able to run benchmark without exceptions
2019-07-04 11:24:37 +02:00
Carl Mastrangelo
65d8ecc3a0 Use Table lookup for HPACK decoder (#9307)
Motivation:
Table based decoding is fast.

Modification:
Use table based decoding in HPACK decoder, inspired by
https://github.com/python-hyper/hpack/blob/master/hpack/huffman_table.py

This modifies the table to be based on integers, rather than 3-tuples of
bytes.  This is for two reasons:

1.  It's faster
2.  Using bytes makes the static intializer too big, and doesn't
compile.

Result:
Faster Huffman decoding.  This only seems to help the ascii case, the
other decoding is about the same.

Benchmarks:

```
Before:
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score       Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   426293.636 ±  1444.843  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    57843.738 ±   725.704  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3002.412 ±    16.998  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   412339.400 ±  1128.394  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    58226.870 ±   199.591  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3044.256 ±    10.675  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2082615.030 ±  5929.726  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   10   571640.454 ± 26499.229  ops/s
HpackDecoderBenchmark.decode           false         true   LARGE  thrpt   20    92714.555 ±  2292.222  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1745872.421 ±  6788.840  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   20   490420.323 ±  2455.431  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   20    84536.200 ±   398.714  ops/s

After(bytes):
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score      Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   472649.148 ± 7122.461  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    66739.638 ±  341.607  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3139.773 ±   24.491  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   466933.833 ± 4514.971  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    66111.778 ±  568.326  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3143.619 ±    3.332  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2109995.177 ± 6203.143  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   20   586026.055 ± 1578.550  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1775723.270 ± 4932.057  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   20   493316.467 ± 1453.037  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   10    85726.219 ±  402.573  ops/s

After(ints):
Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode  Cnt        Score       Error  Units
HpackDecoderBenchmark.decode            true         true   SMALL  thrpt   20   615549.006 ±  5282.283  ops/s
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt   20    86714.630 ±   654.489  ops/s
HpackDecoderBenchmark.decode            true         true   LARGE  thrpt   20     3984.439 ±    61.612  ops/s
HpackDecoderBenchmark.decode            true        false   SMALL  thrpt   20   602489.337 ±  5397.024  ops/s
HpackDecoderBenchmark.decode            true        false  MEDIUM  thrpt   20    88399.109 ±   241.115  ops/s
HpackDecoderBenchmark.decode            true        false   LARGE  thrpt   20     3875.729 ±   103.057  ops/s
HpackDecoderBenchmark.decode           false         true   SMALL  thrpt   20  2092165.454 ± 11918.859  ops/s
HpackDecoderBenchmark.decode           false         true  MEDIUM  thrpt   20   583465.437 ±  5452.115  ops/s
HpackDecoderBenchmark.decode           false         true   LARGE  thrpt   20    93290.061 ±   665.904  ops/s
HpackDecoderBenchmark.decode           false        false   SMALL  thrpt   20  1758402.495 ± 14677.438  ops/s
HpackDecoderBenchmark.decode           false        false  MEDIUM  thrpt   10   491598.099 ±  5029.698  ops/s
HpackDecoderBenchmark.decode           false        false   LARGE  thrpt   20    85834.290 ±   554.915  ops/s
```
2019-07-02 20:13:19 +02:00
jimin
78adeb5408 All override methods must be added @override (#9285)
Motivation:

Some methods that either override others or are implemented as part of implementation an interface did miss the `@Override` annotation

Modifications:

Add missing `@Override`s

Result:

Code cleanup
2019-06-27 13:52:06 +02:00
Alex Blewitt
e233407e01 Replace accumulation with blackhole.consume (#9275)
Motivation:

SpotJMHBugs reports that accumulating a value as a way of eliding dead code
elimination may be inadvisable, as discussed in
`JMHSample_34_SafeLooping::measureWrong_2`. Change the test so that it consumes
the response with `Blackhole::consume` instead.

Modifications:

- Replace addition of results with explicit `blackhole.consume()` call

Result:

Tests work as before, but with different benchmark numbers.
2019-06-25 21:46:26 +02:00
Francesco Nigro
c6114786ab Documented non-usage of BlackHole::consume on ByteBufAccessBenchmark (#9279)
Motivation:

Some JMH benchmarks need additional explanations to motivate
specific code choices.

Modifications:

Introduced comment to explai why calling BlackHole::consume
in a loop is not always the right choice for some benchmark.

Result:

The relevant method shows a comment that warn about changing
the code to introduce BlackHole::consume in the loop.
2019-06-25 14:53:12 +02:00
Alex Blewitt
99034a15b5 Return the result of the list.recycle() call (#9264)
Motivation:

Resolve the issue highlighted by SpotJMHBugs that the creation of the RecyclableArrayList may be elided by the JIT since the result isn't consumed or returned.

Modifications:

Return the result of `list.recycle()` so that the list isn't elided.

Result:

The JMH benchmark shows a change in performance indicating that the prior results of this may be unsound.
2019-06-22 07:21:51 +02:00
Carl Mastrangelo
f01278616a Properly debounce wakeups (#9191)
Motivation:
The wakeup logic in EpollEventLoop is overly complex

Modification:
* Simplify the race to wakeup the loop
* Dont let the event loop wake up itself (it's already awake!)
* Make event loop check if there are any more tasks after preparing to
sleep.  There is small window where the non-eventloop writers can issue
eventfd writes here, but that is okay.

Result:
Cleaner wakeup logic.

Benchmarks:

```
BEFORE
Benchmark                                   Mode  Cnt       Score      Error  Units
EpollSocketChannelBenchmark.executeMulti   thrpt   20  408381.411 ± 2857.498  ops/s
EpollSocketChannelBenchmark.executeSingle  thrpt   20  157022.360 ± 1240.573  ops/s
EpollSocketChannelBenchmark.pingPong       thrpt   20   60571.704 ±  331.125  ops/s

Benchmark                                   Mode  Cnt       Score      Error  Units
EpollSocketChannelBenchmark.executeMulti   thrpt   20  440546.953 ± 1652.823  ops/s
EpollSocketChannelBenchmark.executeSingle  thrpt   20  168114.751 ± 1176.609  ops/s
EpollSocketChannelBenchmark.pingPong       thrpt   20   61231.878 ±  520.108  ops/s
```
2019-06-04 05:27:15 -07:00
Nick Hill
23554e6997 Ensure "full" ownership of msgs passed to EmbeddedChannel.writeInbound() (#9058)
Motivation

Pipeline handlers are free to "take control" of input buffers if they have singular refcount - in particular to mutate their raw data if non-readonly via discarding of read bytes, etc.

However there are various places (primarily unit tests) where a wrapped byte-array buffer is passed in and the wrapped array is assumed not to change (used after the wrapped buffer is passed to EmbeddedChannel.writeInbound()). This invalid assumption could result in unexpected errors, such as those exposed by #8931.

Modifications

Anywhere that the data passed to writeInbound() might be used again, ensure that either:
- A copy is used rather than wrapping a shared byte array, or
- The buffer is otherwise protected from modification by making it read-only

For the tests, copying is preferred since it still allows the "mutating" optimizations to be exercised.

Results

Avoid possible errors when pipeline assumes it has full control of input buffer.
2019-05-22 12:35:03 +02:00
Francesco Nigro
635fc9eae0 The benchmark is not taking into account nanoTime granularity (#9033)
Motivation:

Results are just wrong for small delays.

Modifications:

Switching to AvarageTime avoid to rely on OS nanoTime granularity.

Result:

Uncontended low delay results are not reliable
2019-04-15 15:15:08 +02:00
Norman Maurer
0f34345347
Merge ChannelInboundHandler and ChannelOutboundHandler into ChannelHa… (#8957)
Motivation:

In 42742e233f we already added default methods to Channel*Handler and deprecated the Adapter classes to simplify the class hierarchy. With this change we go even further and merge everything into just ChannelHandler. This simplifies things even more in terms of class-hierarchy.

Modifications:

- Merge ChannelInboundHandler | ChannelOutboundHandler into ChannelHandler
- Adjust code to just use ChannelHandler
- Deprecate old interfaces.

Result:

Cleaner and simpler code in terms of class-hierarchy.
2019-03-28 09:28:27 +00:00
Norman Maurer
42742e233f
Deprecate ChannelInboundHandlerAdapter and ChannelOutboundHandlerAdapter (#8929)
Motivation:

As we now us java8 as minimum java version we can deprecate ChannelInboundHandlerAdapter / ChannelOutboundHandlerAdapter and just move the default implementations into the interfaces. This makes things a bit more flexible for the end-user and also simplifies the class-hierarchy.

Modifications:

- Mark ChannelInboundHandlerAdapter and ChannelOutboundHandlerAdapter as deprecated
- Add default implementations to ChannelInboundHandler / ChannelOutboundHandler
- Refactor our code to not use ChannelInboundHandlerAdapter / ChannelOutboundHandlerAdapter anymore

Result:

Cleanup class-hierarchy and make things a bit more flexible.
2019-03-13 09:46:10 +01:00
Norman Maurer
c6b372f517 Use maven plugin to prevent API/ABI breakage as part of build process (#8904)
Motivation:

Netty is very widely used which can lead to a lot of pain when we break API / ABI. We should make use japicmp-maven-plugin during the build to verify we do not introduce breakage by mistake.

Modifications:

- Add japicmp-maven-plugin to the build process
- Fix a method signature change in HttpProxyHandler that was flagged as a possible problem.

Result:

Ensure no API/ABI breakage accour between releases.
2019-03-01 19:48:29 +01:00
Nick Hill
35161ad174 Further reduce ensureAccessible() overhead (#8895)
Motivation:

This PR fixes some non-negligible overhead discovered in the ByteBuf
accessibility (non-zero refcount) checking. The cause turned out to be
mostly twofold:
- Unnecessary operations used to calculate the refcount from the "raw"
encoded int field value
- Call stack depths exceeding the default limit for inlining, in some
places (CompositeByteBuf in particular)

It's a follow-on from #8882 which uses the maxCapacity field for a
simpler non-negative check. The performance gap between these two
variants appears to be _mostly_ closed, but there's one exception which
may warrant further analysis.

Modifications:

- Replace ABB.internalRefCount() with ByteBuf.isAccessible(), the
default still checks for non-zero refCnt()
- Just test for parity of raw refCnt instead of converting to "real",
with fast-path for specific small values
- Make sure isAccessible() is delegated by derived/wrapper ByteBufs
- Use existing freed flag in CompositeByteBuf for faster isAccessible()
- Manually inline some calls in methods like CompositeByteBuf.setLong()
and AbstractReferenceCountedByteBuf.isAccessible() to reduce stack
depths (to ensure default inlining limit isn't hit)
- Add ByteBufAccessBenchmark which is an extension of
UnsafeByteBufBenchmark (maybe latter could now be removed)

Results:

Before:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
84524972.863 ±   518338.811  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
38608795.037 ±   298176.974  ops/s
readBatch           HEAP               true           true  thrpt   30
80003697.649 ±   974674.119  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
18495554.788 ±   108075.023  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
247069881.578 ± 10839162.593  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
196355905.206 ±  1802420.990  ops/s
setGetLong          HEAP               true           true  thrpt   30
245686644.713 ± 11769311.527  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
83170940.687 ±   657524.123  ops/s
setLong           UNSAFE               true           true  thrpt   30
278940253.918 ±  1807265.259  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
202556738.764 ± 11887973.563  ops/s
setLong             HEAP               true           true  thrpt   30
280045958.053 ±  2719583.400  ops/s
setLong        COMPOSITE               true           true  thrpt   30
121299806.002 ±  2155084.707  ops/s


After:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
101641801.035 ±  3950050.059  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
84395902.846 ±  4339579.057  ops/s
readBatch           HEAP               true           true  thrpt   30
100179060.207 ±  3222487.287  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
42288494.472 ±   294919.633  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
304530755.027 ±  6574163.899  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
212028547.645 ± 14277828.768  ops/s
setGetLong          HEAP               true           true  thrpt   30
309335422.609 ±  2272150.415  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
160383609.236 ±   966484.033  ops/s
setLong           UNSAFE               true           true  thrpt   30
298055969.747 ±  7437449.627  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
223784178.650 ±  9869750.095  ops/s
setLong             HEAP               true           true  thrpt   30
302543263.328 ±  8140104.706  ops/s
setLong        COMPOSITE               true           true  thrpt   30
157083673.285 ±  3528779.522  ops/s

There's also a similar knock-on improvement to other benchmarks (e.g.
HPACK encoding/decoding) as shown in #8882.

For sanity I did a final comparison of the "fast path" tweak using one
of the HPACK benchmarks:

(rawCnt & 1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score     Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  50914.479 ± 940.114  ops/s


rawCnt == 2 || rawCnt == 4 || rawCnt == 6 || rawCnt == 8 ||  (rawCnt &
1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score      Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  60036.425 ± 1478.196  ops/s
2019-02-28 20:41:16 +01:00
Dmitriy Dumanskiy
116f72db8d Legacy properties removed (#8839)
Motivation:

We can remove some properties for which we introduced replacements.

Modifications:

io.netty.buffer.bytebuf.checkAccessible, io.netty.leakDetectionLevel, org.jboss.netty.tryUnsafe properties removed

Result:

Code cleanup
2019-02-04 13:56:15 +01:00
田欧
e8efcd82a8 migrate java8: use requireNonNull (#8840)
Motivation:

We can just use Objects.requireNonNull(...) as a replacement for ObjectUtil.checkNotNull(....)

Modifications:

- Use Objects.requireNonNull(...)

Result:

Less code to maintain.
2019-02-04 10:32:25 +01:00
Dmitriy Dumanskiy
2e433889b2 Improve DateFormatter parsing performance (#8821)
Motivation:

Just was looking through code and found 1 interesting place DateFormatter.tryParseMonth that was not very effective, so I decided to optimize it a bit.

Modification:

Changed DateFormatter.tryParseMonth method. Instead of invocation regionMatch() for every month - compare chars one by one.

Result:

DateFormatter.parseHttpDate method performance improved from ~3% to ~15%.

Benchmark                                                                (DATE_STRING)   Mode  Cnt        Score       Error  Units
DateFormatter2Benchmark.parseHttpHeaderDateFormatter     Sun, 27 Jan 2016 19:18:46 GMT  thrpt    6  4142781.221 ± 82155.002  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatter     Sun, 27 Dec 2016 19:18:46 GMT  thrpt    6  3781810.558 ± 38679.061  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew  Sun, 27 Jan 2016 19:18:46 GMT  thrpt    6  4372569.705 ± 30257.537  ops/s
DateFormatter2Benchmark.parseHttpHeaderDateFormatterNew  Sun, 27 Dec 2016 19:18:46 GMT  thrpt    6  4339785.100 ± 57542.660  ops/s
2019-02-04 10:04:35 +01:00
Norman Maurer
e3846c54f6
Remove ChannelHandler.exceptionCaught(...) as it should only exist in… (#8822)
Motivation:

ChannelHandler.exceptionCaught(...) was marked as @deprecated as it should only exist in inbound handlers.

Modifications:

Remove ChannelHandler.exceptionCaught(...) and adjust code / tests.

Result:

Fixes https://github.com/netty/netty/issues/8527
2019-01-31 20:29:17 +01:00
Dmitriy Dumanskiy
67b23ab056 Remove HttpHeaderDateFormat class (#8807)
Motivation:

HttpHeaderDateFormat was replaced with DateFormatter many days ago and now can be easily removed.

Modification:

Remove deprecated class and related test / benchmark

Result:

Less code to maintain
2019-01-31 07:22:20 +01:00
田欧
6222101924 migrate java8: use lambda and method reference (#8781)
Motivation:

We can use lambdas now as we use Java8.

Modification:

use lambda function for all package, #8751 only migrate transport package.

Result:

Code cleanup.
2019-01-29 14:06:05 +01:00
Norman Maurer
310f31b392
Update to new checkstyle plugin (#8777)
Motivation:

We need to update to a new checkstyle plugin to allow the usage of lambdas.

Modifications:

- Update to new plugin version.
- Fix checkstyle problems.

Result:

Be able to use checkstyle plugin which supports new Java syntax.
2019-01-24 16:24:19 +01:00
Norman Maurer
3d6e6136a9
Decouple EventLoop details from the IO handling for each transport to… (#8680)
* Decouble EventLoop details from the IO handling for each transport to allow easy re-use of code and customization

Motiviation:

As today extending EventLoop implementations to add custom logic / metrics / instrumentations is only possible in a very limited way if at all. This is due the fact that most implementations are final or even package-private. That said even if these would be public there are the ability to do something useful with these is very limited as the IO processing and task processing are very tightly coupled. All of the mentioned things are a big pain point in netty 4.x and need improvement.

Modifications:

This changeset decoubled the IO processing logic from the task processing logic for the main transport (NIO, Epoll, KQueue) by introducing the concept of an IoHandler. The IoHandler itself is responsible to wait for IO readiness and process these IO events. The execution of the IoHandler itself is done by the SingleThreadEventLoop as part of its EventLoop processing. This allows to use the same EventLoopGroup (MultiThreadEventLoupGroup) for all the mentioned transports by just specify a different IoHandlerFactory during construction.

Beside this core API change this changeset also allows to easily extend SingleThreadEventExecutor / SingleThreadEventLoop to add custom logic to it which then can be reused by all the transports. The ideas are very similar to what is provided by ScheduledThreadPoolExecutor (that is part of the JDK). This allows for example things like:

  * Adding instrumentation / metrics:
    * how many Channels are registered on an SingleThreadEventLoop
    * how many Channels were handled during the IO processing in an EventLoop run
    * how many task were handled during the last EventLoop / EventExecutor run
    * how many outstanding tasks we have
    ...
    ...
  * Implementing custom strategies for choosing the next EventExecutor / EventLoop to use based on these metrics.
  * Use different Promise / Future / ScheduledFuture implementations
  * decorate Runnable / Callables when submitted to the EventExecutor / EventLoop

As a lot of functionalities are folded into the MultiThreadEventLoopGroup and SingleThreadEventLoopGroup this changeset also removes:

  * AbstractEventLoop
  * AbstractEventLoopGroup
  * EventExecutorChooser
  * EventExecutorChooserFactory
  * DefaultEventLoopGroup
  * DefaultEventExecutor
  * DefaultEventExecutorGroup

Result:

Fixes https://github.com/netty/netty/issues/8514 .
2019-01-23 08:32:05 +01:00
Dmitriy Dumanskiy
7b92ff2500 Java 8 migration. Remove ThreadLocalProvider and inline java.util.concurrent.ThreadLocalRandom.current() where necessary. (#8762)
Motivation:

Custom Netty ThreadLocalRandom and ThreadLocalRandomProvider classes are no longer needed and can be removed.

Modification:

Remove own ThreadLocalRandom

Result:

Less code to maintain
2019-01-22 20:14:28 +01:00
田欧
9d62deeb6f Java 8 migration: Use diamond operator (#8749)
Motivation:

We can use the diamond operator these days.

Modification:

Use diamond operator whenever possible.

Result:

More modern code and less boiler-plate.
2019-01-22 16:07:26 +01:00
Norman Maurer
8fdf373557
Skip execution of Channel*Handler method if annotated with @Skip and just use the next handler in the pipeline. (#8723)
Motivation:

Invoking ChannelHandlers is not free and can result in some overhead when the ChannelPipeline becomes very long. This is especially true if most handlers will just forward the call to the next handler in the pipeline. When the user extends Channel*HandlerAdapter we can easily detect if can just skip the handler and invoke the next handler in the pipeline directly. This reduce the overhead of dispatch but also reduce the call-stack in many cases.

Modifications:

Detect if we can skip the handler when walking the pipeline.

Result:

Reduce overhead for long pipelines.

Benchmark                                       (extraHandlers)   Mode  Cnt       Score      Error  Units
DefaultChannelPipelineBenchmark.propagateEventOld             4  thrpt   10  267313.031 ± 9131.140  ops/s
DefaultChannelPipelineBenchmark.propagateEvent                4  thrpt   10  824825.673 ± 12727.594  ops/s
2019-01-22 08:58:58 +01:00
Norman Maurer
1fe931b6e2
Make it possible to use a wrapped EventLoop with a Channel (#8677)
Motiviation:

Because of how we implemented the registration / deregistration of an EventLoop it was not possible to wrap an EventLoop implementation and use it with a Channel.

Modification:

- Introduce EventLoop.Unsafe which is responsible for the actual registration.
- Move validation of EventLoop / Channel combo to the EventLoop
- Add unit test that verifies that wrapping works

Result:

Be able to wrap an EventLoop and so add some extra functionality.
2019-01-17 09:17:51 +01:00