Commit Graph

97 Commits

Author SHA1 Message Date
Nick Hill
583d838f7c Optimize AbstractByteBuf.getCharSequence() in US_ASCII case (#8392)
* Optimize AbstractByteBuf.getCharSequence() in US_ASCII case

Motivation:

Inspired by https://github.com/netty/netty/pull/8388, I noticed this
simple optimization to avoid char[] allocation (also suggested in a TODO
here).

Modifications:

Return an AsciiString from AbstractByteBuf.getCharSequence() if
requested charset is US_ASCII or ISO_8859_1 (latter thanks to
@Scottmitch's suggestion). Also tweak unit tests not to require Strings
and include a new benchmark to demonstrate the speedup.

Result:

Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1
cases
2018-10-26 15:32:38 -07:00
Norman Maurer
87ec2f882a
Reduce overhead by ByteBufUtil.decodeString(...) which is used by AbstractByteBuf.toString(...) and AbstractByteBuf.getCharSequence(...) (#8388)
Motivation:

Our current implementation that is used for toString(Charset) operations on AbstractByteBuf implementation is quite slow as it does a lot of uncessary memory copies. We should just use new String(...) as it has a lot of optimizations to handle these cases.

Modifications:

Rewrite ByteBufUtil.decodeString(...) to use new String(...)

Result:

Less overhead for toString(Charset) operations.

Benchmark                                         (charsetName)  (direct)  (size)   Mode  Cnt         Score         Error  Units
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII     false       8  thrpt   20  22401645.093 ? 4671452.479  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII     false      64  thrpt   20  23678483.384 ? 3749164.446  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII      true       8  thrpt   20  15731142.651 ? 3782931.591  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII      true      64  thrpt   20  16244232.229 ? 1886259.658  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8     false       8  thrpt   20  25983680.959 ? 5045782.289  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8     false      64  thrpt   20  26235589.339 ? 2867004.950  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8      true       8  thrpt   20  18499027.808 ? 4784684.268  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8      true      64  thrpt   20  16825286.141 ? 1008712.342  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16     false       8  thrpt   20   5789879.092 ? 1201786.359  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16     false      64  thrpt   20   2173243.225 ?  417809.341  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16      true       8  thrpt   20   5035583.011 ? 1001978.854  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16      true      64  thrpt   20   2162345.301 ?  402410.408  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1     false       8  thrpt   20  30039052.376 ? 6539111.622  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1     false      64  thrpt   20  31414163.515 ? 2096710.526  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1      true       8  thrpt   20  19538587.855 ? 4639115.572  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1      true      64  thrpt   20  19467839.722 ? 1672687.213  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII     false       8  thrpt   20  10787326.745 ? 1034197.864  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII     false      64  thrpt   20   7129801.930 ? 1363019.209  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII      true       8  thrpt   20   9002529.605 ? 2017642.445  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII      true      64  thrpt   20   3860192.352 ?  826218.738  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8     false       8  thrpt   20  10532838.027 ? 2151743.968  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8     false      64  thrpt   20   7185554.597 ? 1387685.785  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8      true       8  thrpt   20   7352253.316 ? 1333823.850  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8      true      64  thrpt   20   2825578.707 ?  349701.156  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16     false       8  thrpt   20   7277446.665 ? 1447034.346  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16     false      64  thrpt   20   2445929.579 ?  562816.641  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16      true       8  thrpt   20   6201174.401 ? 1236137.786  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16      true      64  thrpt   20   2310674.973 ?  525587.959  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1     false       8  thrpt   20  11142625.392 ? 1680556.468  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1     false      64  thrpt   20   8127116.405 ? 1128513.860  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1      true       8  thrpt   20   9405751.952 ? 2193324.806  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1      true      64  thrpt   20   3943282.076 ?  737798.070  ops/s

Benchmark result is saved to /home/norman/mainframer/netty/microbench/target/reports/performance/ByteBufUtilDecodeStringBenchmark.json
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,030.173 sec - in io.netty.buffer.ByteBufUtilDecodeStringBenchmark
[1030.460s][info   ][gc,heap,exit ] Heap
[1030.460s][info   ][gc,heap,exit ]  garbage-first heap   total 516096K, used 257918K [0x0000000609a00000, 0x0000000800000000)
[1030.460s][info   ][gc,heap,exit ]   region size 2048K, 127 young (260096K), 2 survivors (4096K)
[1030.460s][info   ][gc,heap,exit ]  Metaspace       used 17123K, capacity 17438K, committed 17792K, reserved 1064960K
[1030.460s][info   ][gc,heap,exit ]   class space    used 1709K, capacity 1827K, committed 1920K, reserved 1048576K
2018-10-19 14:00:13 +02:00
Norman Maurer
e542a2cf26
Use a non-volatile read for ensureAccessible() whenever possible to reduce overhead and allow better inlining. (#8266)
Motiviation:

At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives.

It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here.

The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check.

Modifications:

- Try to do a non-volatile read using sun.misc.Unsafe if we can use it.
- Add a benchmark

Result:

Big performance win when multiple ByteBuf methods are called from a method.

With the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf  thrpt   20  281395842,128 ± 5050792,296  ops/s

Before the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf  thrpt   20  217419832,801 ± 5080579,030  ops/s
2018-09-07 07:47:02 +02:00
Norman Maurer
02d559e6a4
Remove flags when running benchmarks. (#8262)
Motivation:

Some of the flags we used are not supported anymore on more recent JDK versions. We should just remove all of them and only keep what we really need. This may also reflect better what people use in production.

Modifications:

Remove some flags when running the benchmarks.

Result:

Benchmarks also run with JDK11.
2018-09-05 19:05:02 +02:00
Carl Mastrangelo
379a56ca49 Add an Epoll benchmark
Motivation:
Optimizing the Epoll channel needs an objective measure of how fast
it is.

Modification:
Add a simple, closed loop,  ping-pong benchmark.

Result:
Benchmark can be used to measure #7816

Initial numbers:

```
Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong":
  22614.403 ±(99.9%) 797.263 ops/s [Average]
  (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130
  CI (99.9%): [21817.140, 23411.666] (assumes normal distribution)

Benchmark                              Mode  Cnt      Score     Error  Units
EpollSocketChannelBenchmark.pingPong  thrpt   20  22614.403 ± 797.263  ops/s
```
2018-09-04 10:15:15 +02:00
Francesco Nigro
c78be33443 Added configurable ByteBuf bounds checking (#7521)
Motivation:

The JVM isn't always able to hoist out/reduce bounds checking (due to ref counting operations etc etc) hence making it configurable could improve performances for most CPU intensive use cases.

Modifications:

Each AbstractByteBuf bounds check has been tested against a new static final configuration property similar to checkAccessible ie io.netty.buffer.bytebuf.checkBounds.

Result:

Any user could disable ByteBuf bounds checking in order to get extra performances.
2018-09-03 20:33:47 +02:00
Norman Maurer
83710cb2e1
Replace toArray(new T[size]) with toArray(new T[0]) to eliminate zero-out and allow the VM to optimize. (#8075)
Motivation:

Using toArray(new T[0]) is usually the faster aproach these days. We should use it.

See also https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_conclusion.

Modifications:

Replace toArray(new T[size]) with toArray(new T[0]).

Result:

Faster code.
2018-06-29 07:56:04 +02:00
unknown
4a8d3a274c Including the setup code in the benchmark method to avoid JMH Invocation level hiccups.
Motivation:

The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant overhead
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).

In the case of CodecInputListBenchmark, benchmarks are far too small (less than 50ns) and the Invocation
level setup offsets the measurement considerably.
On such cases, the recommended fix patch is to include the setup/teardown code in the benchmark method.

Modifications:

Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown methods from the benchmark class.

Result:

We run the entire benchmark 10 times with default parameters we observed:
- ArrayList benchmark affected directly by JMH overhead is now from 15-80% faster.
- CodecList benchmark is now 50% faster than original (even with the setup code being measured).
- Recyclable ArrayList is ~30% slower.
- All benchmarks have significant different means (ANOVA) and medians (Moore)

Mode: Throughput (Higher the better)

Method	              Full params		Factor	    Modified (Median)	Original (Median)
recyclableArrayList	 (elements = 1)		0.615520967	21719082.75	        35285691.2
recyclableArrayList	 (elements = 4)		0.699553431	17149442.76	        24514843.31
arrayList	         (elements = 4)		1.152666631	27120407.18	        23528404.88
codecOutList	     (elements = 1)		1.527275908	67251089.04	        44033359.47
codecOutList	     (elements = 4)		1.596917095	59174088.78	        37055204.03
arrayList	         (elements = 1)		1.878616889	62188238.24	        33103204.06

Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ  (6 cores + HT), 64 GB RAM.
2018-06-21 12:22:13 +02:00
unknown
cb420a9ffc Including the setup code in the benchmark method to avoid JMH Invocation level hiccups.
Motivation:

The usage of Invocation level for JMH fixture methods (setup/teardown) inccurs in a significant impact in
in the benchmark time (see org.openjdk.jmh.annotations.Level documentation).

When the benchmark and the setup/teardown is too small (less than a milisecond) the Invocation level might saturate the system with
timestamp requests and iteration synchronizations which introduce artificial latency, throughput, and scalability bottlenecks.

In the HeadersBenchmark, all benchmarks take less than 100ns and the Invocation level setup offsets the measurement considerably.
As fixture methods is defined for the entire class, this overhead also impacts every single benchmark in this class, not only
the ones that use the emptyHttpHeaders object (cleaned in the setup).

The recommended fix patch here is to include the setup/teardown code in the benchmark where the object is used.

Modifications:

Include the setup/teardown code in the relevant benchmark methods.
Remove the setup/teardown method of Invocation level from the benchmark class.

Result:

We run all benchmarks from HeadersBenchmark 10 times with default parameter, we observe:
- Benchmarks that were not directly affected by the fix patch, improved execution time.
    For instance, http2Remove with (exampleHeader = THREE) had its median reported as 2x faster than the original version.
- Benchmarks that had the setup code inserted (eg. http2AddAllFastest) did not suffer a significant punch in the execution time,
as the benchmarks are not dominated by the clear().

Environment:
Tests run on a Computational server with CPU: E5-1660-3.3GHZ  (6 cores + HT), 64 GB RAM.
2018-06-21 12:21:19 +02:00
Scott Mitchell
9d51a40df0 Update NetUtilBenchmark (#7826)
Motivation:
NetUtilBenchmark is using out of date data, throws an exception in the benchmark, and allocates a Set on each run.

Modifications:
- Update the benchmark and reduce each run's overhead

Result:
NetUtilBenchmark is updated.
2018-03-31 08:27:08 +02:00
Francesco Nigro
ed46c4ed00 Copies from read-only heap ByteBuffer to direct ByteBuf can avoid stealth ByteBuf allocation and additional copies
Motivation:

Read-only heap ByteBuffer doesn't expose array: the existent method to perform copies to direct ByteBuf involves the creation of a (maybe pooled) additional heap ByteBuf instance and copy

Modifications:

To avoid stressing the allocator with additional (and stealth) heap ByteBuf allocations is provided a method to perform copies using the (pooled) internal NIO buffer

Result:

Copies from read-only heap ByteBuffer to direct ByteBuf won't create any intermediate ByteBuf
2018-02-27 09:54:21 +09:00
Julien Hoarau
3e6b54bb59 Fix failing h2spec tests 8.1.2.1 related to pseudo-headers validation
Motivation:

According to the spec:
All pseudo-header fields MUST appear in the header block before regular
header fields. Any request or response that contains a pseudo-header
field that appears in a header block after
a regular header field MUST be treated as malformed (Section 8.1.2.6).

Pseudo-header fields are only valid in the context in which they are defined.
Pseudo-header fields defined for requests MUST NOT appear in responses;
pseudo-header fields defined for responses MUST NOT appear in requests.
Pseudo-header fields MUST NOT appear in trailers.
Endpoints MUST treat a request or response that contains undefined or
invalid pseudo-header fields as malformed (Section 8.1.2.6).

Clients MUST NOT accept a malformed response. Note that these requirements
are intended to protect against several types of common attacks against HTTP;
they are deliberately strict because being permissive can expose
implementations to these vulnerabilities.

Modifications:

- Introduce validation in HPackDecoder

Result:

- Requests with unknown pseudo-field headers are rejected
- Requests with containing response specific pseudo-headers are rejected
- Requests where pseudo-header appear after regular header are rejected
- h2spec 8.1.2.1 pass
2018-01-29 19:42:56 -08:00
Norman Maurer
4c1e0f596a Use FastThreadLocal for CodecOutputList
Motivation:

We used Recycler for the CodecOutputList which is not optimized for the use-case of access only from the same Thread all the time.

Modifications:

- Use FastThreadLocal for CodecOutputList
- Add benchmark

Result:

Less overhead in our codecs.
2018-01-23 11:34:28 +01:00
Francesco Nigro
1cf2687244 Fixed JMH ByteBuf benchmark to avoid dead code elimination
Motivation:

The JMH doc suggests to use BlackHoles to avoid dead code elimination hence would be better to follow this best practice.

Modifications:

Each benchmark method is returning the ByteBuf/ByteBuffer to avoid the JVM to perform any dead code elimination.

Result:

The results are more reliable and comparable to the others provided by other ByteBuf benchmarks (eg HeapByteBufBenchmark)
2017-12-19 14:09:18 +01:00
Scott Mitchell
55ef09f191 Add HttpObjectEncoderBenchmark
Motivation:
Benchmark to measure HttpObjectEncoder performance.

Modifications:
- Create new benchmark HttpObjectEncoderBenchmark

Result:
JMH Microbenchmark for HttpObjectEncoder.
2017-12-16 13:47:34 +01:00
Scott Mitchell
5f0342ebe0 Add RedisEncoderBenchmark
Motivation:
Add a benchmark to measure RedisEncoder's performance

Modifications:
- Add RedisEncoderBenchmark

Result:
JMH benchmark exists to measure RedisEncoder's performance.
2017-12-16 13:42:50 +01:00
Scott Mitchell
93b144b7b4 HttpMethod#valueOf improvement
Motivation:
HttpMethod#valueOf shows up on profiler results in the top set of
results. Since it is a relatively simple operation it can be improved in
isolation.

Modifications:
- Introduce a special case map which assigns each HttpMethod to a unique
index in an array and provides constant time lookup from a hash code
algorithm. When the bucket is matched we can then directly do equality
comparison instead of potentially following a linked structure when
HashMap has hash collisions.

Result:
~10% improvement in benchmark results for HttpMethod#valueOf

Benchmark                                     Mode  Cnt   Score   Error   Units
HttpMethodMapBenchmark.newMapKnownMethods    thrpt   16  31.831 ± 0.928  ops/us
HttpMethodMapBenchmark.newMapMixMethods      thrpt   16  25.568 ± 0.400  ops/us
HttpMethodMapBenchmark.newMapUnknownMethods  thrpt   16  51.413 ± 1.824  ops/us
HttpMethodMapBenchmark.oldMapKnownMethods    thrpt   16  29.226 ± 0.330  ops/us
HttpMethodMapBenchmark.oldMapMixMethods      thrpt   16  21.073 ± 0.247  ops/us
HttpMethodMapBenchmark.oldMapUnknownMethods  thrpt   16  49.081 ± 0.577  ops/us
2017-11-20 11:07:50 -08:00
Scott Mitchell
e6126215e0 DefaultHttp2FrameWriter reduce object allocation
Motivation:
DefaultHttp2FrameWriter#writeData allocates a DataFrameHeader for each write operation. DataFrameHeader maintains internal state and allocates multiple slices of a buffer which is a maximum of 30 bytes. This 30 byte buffer may not always be necessary and the additional slice operations can utilize retainedSlice to take advantage of pooled objects. We can also save computation and object allocations if there is no padding which is a common case in practice.

Modifications:
- Remove DataFrameHeader
- Add a fast path for padding == 0

Result:
Less object allocation in DefaultHttp2FrameWriter
2017-11-20 08:10:59 -08:00
Anuraag Agrawal
1f1a60ae7d Use Netty's DefaultPriorityQueue instead of JDK's PriorityQueue for scheduled tasks
Motivation:

`AbstractScheduledEventExecutor` uses a standard `java.util.PriorityQueue` to keep track of task deadlines. `ScheduledFuture.cancel` removes tasks from this `PriorityQueue`. Unfortunately, `PriorityQueue.remove` has `O(n)` performance since it must search for the item in the entire queue before removing it. This is fast when the future is at the front of the queue (e.g., already triggered) but not when it's randomly located in the queue.

Many servers will use `ScheduledFuture.cancel` on all requests, e.g., to manage a request timeout. As these cancellations will be happen in arbitrary order, when there are many scheduled futures, `PriorityQueue.remove` is a bottleneck and greatly hurts performance with many concurrent requests (>10K).

Modification:

Use netty's `DefaultPriorityQueue` for scheduling futures instead of the JDK. `DefaultPriorityQueue` is almost identical to the JDK version except it is able to remove futures without searching for them in the queue. This means `DefaultPriorityQueue.remove` has `O(log n)` performance.

Result:

Before - cancelling futures has varying performance, capped at `O(n)`
After - cancelling futures has stable performance, capped at `O(log n)`

Benchmark results

After - cancelling in order and in reverse order have similar performance within `O(log n)` bounds
```
Benchmark                                           (num)   Mode  Cnt       Score      Error  Units
ScheduledFutureTaskBenchmark.cancelInOrder            100  thrpt   20  137779.616 ± 7709.751  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder           1000  thrpt   20   11049.448 ±  385.832  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder          10000  thrpt   20     943.294 ±   12.391  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder         100000  thrpt   20      64.210 ±    1.824  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder     100  thrpt   20  167531.096 ± 9187.865  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder    1000  thrpt   20   33019.786 ± 4737.770  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder   10000  thrpt   20    2976.955 ±  248.555  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder  100000  thrpt   20     362.654 ±   45.716  ops/s
```

Before - cancelling in order and in reverse order have significantly different performance at higher queue size, orders of magnitude worse than the new implementation.
```
Benchmark                                           (num)   Mode  Cnt       Score       Error  Units
ScheduledFutureTaskBenchmark.cancelInOrder            100  thrpt   20  139968.586 ± 12951.333  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder           1000  thrpt   20   12274.420 ±   337.800  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder          10000  thrpt   20     958.168 ±    15.350  ops/s
ScheduledFutureTaskBenchmark.cancelInOrder         100000  thrpt   20      53.381 ±    13.981  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder     100  thrpt   20  123918.829 ±  3642.517  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder    1000  thrpt   20    5099.810 ±   206.992  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder   10000  thrpt   20      72.335 ±     0.443  ops/s
ScheduledFutureTaskBenchmark.cancelInReverseOrder  100000  thrpt   20       0.743 ±     0.003  ops/s
```
2017-11-10 23:09:32 -08:00
Carl Mastrangelo
83a19d5650 Optimistically update ref counts
Motivation:
Highly retained and released objects have contention on their ref
count.  Currently, the ref count is updated using compareAndSet
with care to make sure the count doesn't overflow, double free, or
revive the object.

Profiling has shown that a non trivial (~1%) of CPU time on gRPC
latency benchmarks is from the ref count updating.

Modification:
Rather than pessimistically assuming the ref count will be invalid,
optimistically update it assuming it will be.  If the update was
wrong, then use the slow path to revert the change and throw an
execption.  Most of the time, the ref counts are correct.

This changes from using compareAndSet to getAndAdd, which emits a
different CPU instruction on x86 (CMPXCHG to XADD).  Because the
CPU knows it will modifiy the memory, it can avoid contention.

On a highly contended machine, this can be about 2x faster.

There is a downside to the new approach.  The ref counters can
temporarily enter invalid states if over retained or over released.
The code does handle these overflow and underflow scenarios, but it
is possible that another concurrent access may push the failure to
a different location.  For example:

Time 1 Thread 1: obj.retain(INT_MAX - 1)
Time 2 Thread 1: obj.retain(2)
Time 2 Thread 2: obj.retain(1)

Previously Thread 2 would always succeed and Thread 1 would always
fail on the second access.  Now, thread 2 could fail while thread 1
is rolling back its change.

====

There are a few reasons why I think this is okay:

1. Buggy code is going to have bugs.  An exception _is_ going to be
   thrown.  This just causes the other threads to notice the state
   is messed up and stop early.
2. If high retention counts are a use case, then ref count should
   be a long rather than an int.
3. The critical section is greatly reduced compared to the previous
   version, so the likelihood of this happening is lower
4. On error, the code always rollsback the change atomically, so
   there is no possibility of corruption.

Result:
Faster refcounting

```
BEFORE:

Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  2901361       804.579 ±  1.835  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3038729       785.376 ± 16.471  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  2899401       817.392 ±  6.668  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3650566      2077.700 ±  0.600  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3005467     19949.334 ±  4.243  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   456091        48.610 ±  1.162  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   732051        62.599 ±  0.815  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   778925       228.629 ±  1.205  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   633682      2002.987 ±  2.856  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506442     19735.345 ± 12.312  ns/op

AFTER:
Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  3761980       383.436 ±  1.315  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3667304       474.429 ±  1.101  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  3039374       479.267 ±  0.435  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3709210      2044.603 ±  0.989  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3011591     19904.227 ± 18.025  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   494975        52.269 ±  8.345  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   771094        62.290 ±  0.795  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   763230       235.044 ±  1.552  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   634037      2006.578 ±  3.574  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506284     19742.605 ± 13.729  ns/op

```
2017-10-04 08:42:33 +02:00
Norman Maurer
3c8c7fc7e9 Reduce performance overhead of ResourceLeakDetector
Motiviation:

The ResourceLeakDetector helps to detect and troubleshoot resource leaks and is often used even in production enviroments with a low level. Because of this its import that we try to keep the overhead as low as overhead. Most of the times no leak is detected (as all is correctly handled) so we should keep the overhead for this case as low as possible.

Modifications:

- Only call getStackTrace() if a leak is reported as it is a very expensive native call. Also handle the filtering and creating of the String in a lazy fashion
- Remove the need to mantain a Queue to store the last access records
- Add benchmark

Result:

Huge decrease of performance overhead.

Before the patch:

Benchmark                                           (recordTimes)   Mode  Cnt     Score     Error  Units
ResourceLeakDetectorRecordBenchmark.record                      8  thrpt   20  4358.367 ± 116.419  ops/s
ResourceLeakDetectorRecordBenchmark.record                     16  thrpt   20  2306.027 ±  55.044  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint              8  thrpt   20  4220.979 ± 114.046  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint             16  thrpt   20  2250.734 ±  55.352  ops/s

With this patch:

Benchmark                                           (recordTimes)   Mode  Cnt      Score      Error  Units
ResourceLeakDetectorRecordBenchmark.record                      8  thrpt   20  71398.957 ± 2695.925  ops/s
ResourceLeakDetectorRecordBenchmark.record                     16  thrpt   20  38643.963 ± 1446.694  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint              8  thrpt   20  71677.882 ± 2923.622  ops/s
ResourceLeakDetectorRecordBenchmark.recordWithHint             16  thrpt   20  38660.176 ± 1467.732  ops/s
2017-09-18 16:36:19 -07:00
Nikolay Fedorovskikh
df568c739e Use ByteBuf#writeShort/writeMedium instead of writeBytes
Motivation:

1. Some encoders used a `ByteBuf#writeBytes` to write short constant byte array (2-3 bytes). This can be replaced with more faster `ByteBuf#writeShort` or `ByteBuf#writeMedium` which do not access the memory.
2. Two chained calls of the `ByteBuf#setByte` with constants can be replaced with one `ByteBuf#setShort` to reduce index checks.
3. The signature of method `HttpHeadersEncoder#encoderHeader` has an unnecessary `throws`.

Modifications:

1. Use `ByteBuf#writeShort` or `ByteBuf#writeMedium` instead of `ByteBuf#writeBytes` for the constants.
2. Use `ByteBuf#setShort` instead of chained call of the `ByteBuf#setByte` with constants.
3. Remove an unnecessary `throws` from `HttpHeadersEncoder#encoderHeader`.

Result:

A bit faster writes constants into buffers.
2017-07-10 14:37:41 +02:00
Dmitriy Dumanskiy
dd69a813d4 Performance improvement for HttpRequestEncoder. Insert char into the string optimized.
Motivation:

Right now HttpRequestEncoder does insertion of slash for url like http://localhost?pararm=1 before the question mark. It is done not effectively.

Modification:

Code:

new StringBuilder(len + 1)
                .append(uri, 0, index)
                .append(SLASH)
                .append(uri, index, len)
                .toString();
Replaced with:

new StringBuilder(uri)
                .insert(index, SLASH)
                .toString();
Result:

Faster HttpRequestEncoder. Additional small test. Attached benchmark in PR.

Benchmark                                      Mode  Cnt        Score        Error  Units
HttpRequestEncoderInsertBenchmark.newEncoder  thrpt   40  3704843.303 ±  98950.919  ops/s
HttpRequestEncoderInsertBenchmark.oldEncoder  thrpt   40  3284236.960 ± 134433.217  ops/s
2017-06-27 10:53:43 +02:00
Nikolay Fedorovskikh
aa38b6a769 Prevent unnecessary allocations in the StringUtil#escapeCsv
Motivation:

A `StringUtil#escapeCsv` creates new `StringBuilder` on each value even if the same string is returned in the end.

Modifications:

Create new `StringBuilder` only if it really needed. Otherwise, return the original string (or just trimmed substring).

Result:

Less GC load. Up to 4x faster work for not changed strings.
2017-06-13 14:57:38 -07:00
Dmitriy Dumanskiy
acc07fac32 disabling leak detection micro benchmark
Motivation:

When I run Netty micro benchmarks I get many warnings like:

WARNING: -Dio.netty.noResourceLeakDetection is deprecated. Use '-Dio.netty.leakDetection.level=simple' instead.

Modification:

-Dio.netty.noResourceLeakDetection replaced with -Dio.netty.leakDetection.level=disabled.

Result:

No warnings.
2017-06-09 18:03:54 +02:00
Nikolay Fedorovskikh
e4531918a3 Optimizations in NetUtil
Motivation:

IPv4/6 validation methods use allocations, which can be avoided.
IPv4 parse method use StringTokenizer.

Modifications:

Rewriting IPv4/6 validation methods to avoid allocations.
Rewriting IPv4 parse method without use StringTokenizer.

Result:

IPv4/6 validation and IPv4 parsing faster up to 2-10x.
2017-05-18 16:42:22 -07:00
Nikolay Fedorovskikh
0692bf1b6a fix the typos 2017-04-20 04:56:09 +02:00
Norman Maurer
e482d933f7 Add 'io.netty.tryAllocateUninitializedArray' system property which allows to allocate byte[] without memset in Java9+
Motivation:

Java9 added a new method to Unsafe which allows to allocate a byte[] without memset it. This can have a massive impact in allocation times when the byte[] is big. This change allows to enable this when using Java9 with the io.netty.tryAllocateUninitializedArray property when running Java9+. Please note that you will need to open up the jdk.internal.misc package via '--add-opens java.base/jdk.internal.misc=ALL-UNNAMED' as well.

Modifications:

Allow to allocate byte[] without memset on Java9+

Result:

Better performance when allocate big heap buffers and using java9.
2017-04-19 11:45:39 +02:00
Ade Setyawan Sajim
016629fe3b Replace system.out.println with InternalLoggerFactory
Motivation:

There are two files that still use `system.out.println` to log their status

Modification:

Replace `system.out.println` with a `debug` function inside an instance of `InternalLoggerFactory`

Result:

Introduce an instance of `InternalLoggerFactory` in class `AbstractMicrobenchmark.java` and `AbstractSharedExecutorMicrobenchmark.java`
2017-03-28 14:51:59 +02:00
Scott Mitchell
743d2d374c SslHandler benchmark and SslEngine multiple packets benchmark
Motivation:
We currently don't have a benchmark which includes SslHandler. The SslEngine benchmarks also always include a single TLS packet when encoding/decoding. In practice when reading data from the network there may be multiple TLS packets present and we should expand the benchmarks to understand this use case.

Modifications:
- SslEngine benchmarks should include wrapping/unwrapping of multiple TLS packets
- Introduce SslHandler benchmarks which can also account for wrapping/unwrapping of multiple TLS packets

Result:
SslHandler and SslEngine benchmarks are more comprehensive.
2017-03-06 08:42:39 -08:00
Scott Mitchell
f9001b9fc0 HTTP/2 move internal HPACK classes to the http2 package
Motivation:
The internal.hpack classes are no longer exposed in our public APIs and can be made package private in the http2 package.

Modifications:
- Make the hpack classes package private in the http2 package

Result:
Less APIs exposed as public.
2017-03-02 07:42:41 -08:00
Norman Maurer
461f9a1212 Allow to obtain informations of used direct and heap memory for ByteBufAllocator implementations
Motivation:

Often its useful for the user to be able to get some stats about the memory allocated via an allocator.

Modifications:

- Allow to obtain the used heap and direct memory for an allocator
- Add test case

Result:

Fixes [#6341]
2017-03-01 18:53:43 +01:00
Norman Maurer
90a61046c7 Add benchmarks for UnpooledUnsafeNoCleanerDirectByteBuf vs UnpooledUnsafeDirectByteBuf
Motivation:

Issue [#6349] brought up the idea to not use UnpooledUnsafeNoCleanerDirectByteBuf by default. To decide what to do a benchmark is needed.

Modifications:

Add benchmarks for UnpooledUnsafeNoCleanerDirectByteBuf vs UnpooledUnsafeDirectB
yteBuf

Result:

Better idea about impact of using UnpooledUnsafeNoCleanerDirectByteBuf.
2017-02-27 20:04:09 +01:00
Norman Maurer
d73477c7bd Add benchmarks for SSLEngine implementations
Motivation:

As we provide our own SSLEngine implementation we should have benchmarks to compare it against JDK impl.

Modifications:

Add benchmarks for wrap / unwrap and handshake performance.

Result:

Benchmarks FTW.
2017-02-24 08:02:10 +01:00
Norman Maurer
a80d3411ee Move all the microbenchmark code into one directory.
Motivation:

Allmost all our benchmarks are in src/main/java but a few are in src/test/java. We should make it consistent.

Modifications:

Move everything to src/main/java

Result:

Consistent code base.
2017-02-23 19:59:09 +01:00
Nikolay Fedorovskikh
0623c6c533 Fix javadoc issues
Motivation:

Invalid javadoc in project

Modifications:

Fix it

Result:

More correct javadoc
2017-02-22 07:31:07 +01:00
Nikolay Fedorovskikh
634a8afa53 Fix some warnings at generics usage
Motivation:

Existing warnings from java compiler

Modifications:

Add/fix type parameters

Result:

Less warnings
2017-02-22 07:29:59 +01:00
Kiril Menshikov
66b9be3a46 Allow to allign allocated Buffers
Motivation:

64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes.
Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code.

Modification:

At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files

Result:

Buffer alignment works better than miss-align cache.
2017-02-06 07:58:29 +01:00
Scott Mitchell
3482651e0c HTTP/2 Non Active Stream RFC Corrections
Motivation:
codec-http2 couples the dependency tree state with the remainder of the stream state (Http2Stream). This makes implementing constraints where stream state and dependency tree state diverge in the RFC challenging. For example the RFC recommends retaining dependency tree state after a stream transitions to closed [1]. Dependency tree state can be exchanged on streams in IDLE. In practice clients may use stream IDs for the purpose of establishing QoS classes and therefore retaining this dependency tree state can be important to client perceived performance. It is difficult to limit the total amount of state we retain when stream state and dependency tree state is combined.

Modifications:
- Remove dependency tree, priority, and weight related items from public facing Http2Connection and Http2Stream APIs. This information is optional to track and depends on the flow controller implementation.
- Move all dependency tree, priority, and weight related code from DefaultHttp2Connection to WeightedFairQueueByteDistributor. This is currently the only place which cares about priority. We can pull out the dependency tree related code in the future if it is generally useful to expose for other implementations.
- DefaultHttp2Connection should explicitly limit the number of reserved streams now that IDLE streams are no longer created.

Result:
More compliant with the HTTP/2 RFC.
Fixes https://github.com/netty/netty/issues/6206.

[1] https://tools.ietf.org/html/rfc7540#section-5.3.4
2017-02-01 10:34:27 -08:00
Scott Mitchell
e13da218e9 HTTP/2 revert Http2FrameWriter throws API change
Motivation:
2fd42cfc6b fixed a bug related to encoding headers but it also introduced a throws statement onto the Http2FrameWriter methods which write headers. This throws statement makes the API more verbose and is not necessary because we can communicate the failure in the ChannelFuture that is returned by these methods.

Modifications:
- Remove throws from all Http2FrameWriter methods.

Result:
Http2FrameWriter APIs do not propagate checked exceptions.
2017-01-26 23:26:17 -08:00
Tim Brooks
3344cd21ac Wrap operations requiring SocketPermission with doPrivileged blocks
Motivation:

Currently Netty does not wrap socket connect, bind, or accept
operations in doPrivileged blocks. Nor does it wrap cases where a dns
lookup might happen.

This prevents an application utilizing the SecurityManager from
isolating SocketPermissions to Netty.

Modifications:

I have introduced a class (SocketUtils) that wraps operations
requiring SocketPermissions in doPrivileged blocks.

Result:

A user of Netty can grant SocketPermissions explicitly to the Netty
jar, without granting it to the rest of their application.
2017-01-19 21:12:52 +01:00
Scott Mitchell
2fd42cfc6b HTTP/2 Max Header List Size Bug
Motivation:
If the HPACK Decoder detects that SETTINGS_MAX_HEADER_LIST_SIZE has been violated it aborts immediately and sends a RST_STREAM frame for what ever stream caused the issue. Because HPACK is stateful this means that the HPACK state may become out of sync between peers, and the issue won't be detected until the next headers frame. We should make a best effort to keep processing to keep the HPACK state in sync with our peer, or completely close the connection.
If the HPACK Encoder is configured to verify SETTINGS_MAX_HEADER_LIST_SIZE it checks the limit and encodes at the same time. This may result in modifying the HPACK local state but not sending the headers to the peer if SETTINGS_MAX_HEADER_LIST_SIZE is violated. This will also lead to an inconsistency in HPACK state that will be flagged at some later time.

Modifications:
- HPACK Decoder now has 2 levels of limits related to SETTINGS_MAX_HEADER_LIST_SIZE. The first will attempt to keep processing data and send a RST_STREAM after all data is processed. The second will send a GO_AWAY and close the entire connection.
- When the HPACK Encoder enforces SETTINGS_MAX_HEADER_LIST_SIZE it should not modify the HPACK state until the size has been checked.
- https://tools.ietf.org/html/rfc7540#section-6.5.2 states that the initial value of SETTINGS_MAX_HEADER_LIST_SIZE is "unlimited". We currently use 8k as a limit. We should honor the specifications default value so we don't unintentionally close a connection before the remote peer is aware of the local settings.
- Remove unnecessary object allocation in DefaultHttp2HeadersDecoder and DefaultHttp2HeadersEncoder.

Result:
Fixes https://github.com/netty/netty/issues/6209.
2017-01-19 10:42:43 -08:00
Scott Mitchell
b701da8d1c HTTP/2 HPACK Integer Encoding Bugs
Motivation:
- Decoder#decodeULE128 has a bounds bug and cannot decode Integer.MAX_VALUE
- Decoder#decodeULE128 doesn't support values greater than can be represented with Java's int data type. This is a problem because there are cases that require at least unsigned 32 bits (max header table size).
- Decoder#decodeULE128 treats overflowing the data type and invalid input the same. This can be misleading when inspecting the error that is thrown.
- Encoder#encodeInteger doesn't support values greater than can be represented with Java's int data type. This is a problem because there are cases that require at least unsigned 32 bits (max header table size).

Modifications:
- Correct the above issues and add unit tests.

Result:
Fixes https://github.com/netty/netty/issues/6210.
2017-01-18 18:36:47 -08:00
Scott Mitchell
06e7627b5f Read Only Http2Headers
Motivation:
A read only implementation of Http2Headers can allow for a more efficient usage of memory and more performant combined construction and iteration during serialization.

Modifications:
- Add a new ReadOnlyHttp2Headers class

Result:
ReadOnlyHttp2Headers exists and can be used for performance reasons when appropriate.

```
Benchmark                                            (headerCount)  Mode  Cnt    Score   Error  Units
ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders               1  avgt   20   96.156 ± 1.902  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders               5  avgt   20  157.925 ± 3.847  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders              10  avgt   20  236.257 ± 2.663  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultClientHeaders              20  avgt   20  392.861 ± 3.932  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders               1  avgt   20   48.759 ± 0.466  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders               5  avgt   20  113.122 ± 0.948  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders              10  avgt   20  192.698 ± 1.936  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultServerHeaders              20  avgt   20  348.974 ± 3.111  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultTrailers                    1  avgt   20   35.694 ± 0.271  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultTrailers                    5  avgt   20   98.993 ± 2.933  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultTrailers                   10  avgt   20  171.035 ± 5.068  ns/op
ReadOnlyHttp2HeadersBenchmark.defaultTrailers                   20  avgt   20  330.621 ± 3.381  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders              1  avgt   20   40.573 ± 0.474  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders              5  avgt   20   56.516 ± 0.660  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders             10  avgt   20   76.890 ± 0.776  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyClientHeaders             20  avgt   20  117.531 ± 1.393  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders              1  avgt   20   29.206 ± 0.264  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders              5  avgt   20   44.587 ± 0.312  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders             10  avgt   20   64.458 ± 1.169  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyServerHeaders             20  avgt   20  107.179 ± 0.881  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers                   1  avgt   20   21.563 ± 0.202  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers                   5  avgt   20   41.019 ± 0.440  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers                  10  avgt   20   64.053 ± 0.785  ns/op
ReadOnlyHttp2HeadersBenchmark.readOnlyTrailers                  20  avgt   20  113.737 ± 4.433  ns/op
```
2016-12-18 09:32:24 -08:00
Stephane Landelle
f755e58463 Clean up following #6016
Motivation:

* DefaultHeaders from netty-codec has some duplicated logic for header date parsing
* Several classes keep on using deprecated HttpHeaderDateFormat

Modifications:

* Move HttpHeaderDateFormatter to netty-codec and rename it into HeaderDateFormatter
* Make DefaultHeaders use HeaderDateFormatter
* Replace HttpHeaderDateFormat usage with HeaderDateFormatter

Result:

Faster and more consistent code
2016-11-21 12:35:40 -08:00
Stephane Landelle
edc4842309 Fix cookie date parsing, close #6016
Motivation:
* RFC6265 defines its own parser which is different from RFC1123 (it accepts RFC1123 format but also other ones). Basically, it's very lax on delimiters, ignores day of week and timezone. Currently, ClientCookieDecoder uses HttpHeaderDateFormat underneath, and can't parse valid cookies such as Github ones whose expires attribute looks like "Sun, 27 Nov 2016 19:37:15 -0000"
* ServerSideCookieEncoder currently uses HttpHeaderDateFormat underneath for formatting expires field, and it's slow.

Modifications:
* Introduce HttpHeaderDateFormatter that correctly implement RFC6265
* Use HttpHeaderDateFormatter in ClientCookieDecoder and ServerCookieEncoder
* Deprecate HttpHeaderDateFormat

Result:
* Proper RFC6265 dates support
* Faster ServerCookieEncoder and ClientCookieDecoder
* Faster tool for handling headers such as "Expires" and "Date"
2016-11-18 11:22:21 +00:00
buchgr
c9918de37b http2: Make MAX_HEADER_LIST_SIZE exceeded a stream error when encoding.
Motivation:

The SETTINGS_MAX_HEADER_LIST_SIZE limit, as enforced by the HPACK Encoder, should be a stream error and not apply to the whole connection.

Modifications:

Made the necessary changes for the exception to be of type StreamException.

Result:

A HEADERS frame exceeding the limit, only affects a specific stream.
2016-10-17 09:24:06 -07:00
Scott Mitchell
540c26bb56 HTTP/2 Ensure default settings are correctly enforced and interfaces clarified
Motivation:
The responsibility for retaining the settings values and enforcing the settings constraints is spread out in different areas of the code and may be initialized with different values than the default specified in the RFC. This should not be allowed by default and interfaces which are responsible for maintaining/enforcing settings state should clearly indicate the restrictions that they should only be set by the codec upon receipt of a SETTINGS ACK frame.

Modifications:
- Encoder, Decoder, and the Headers Encoder/Decoder no longer expose public constructors that allow the default settings to be changed.
- Http2HeadersDecoder#maxHeaderSize() exists to provide some bound when headers/continuation frames are being aggregated. However this is roughly the same as SETTINGS_MAX_HEADER_LIST_SIZE (besides the 32 byte octet for each header field) and can be used instead of attempting to keep the two independent values in sync.
- Encoding headers now enforces SETTINGS_MAX_HEADER_LIST_SIZE at the octect level. Previously the header encoder compared the number of header key/value pairs against SETTINGS_MAX_HEADER_LIST_SIZE instead of the number of octets (plus 32 bytes overhead).
- DefaultHttp2ConnectionDecoder#onData calls shouldIgnoreHeadersOrDataFrame but may swallow exceptions from this method. This means a STREAM_RST frame may not be sent when it should for an unknown stream and thus violate the RFC. The exception is no longer swallowed.

Result:
Default settings state is enforced and interfaces related to settings state are clarified.
2016-10-07 13:00:45 -07:00
radai-rosenblatt
15ac6c4a1f Clean-up unused imports
Motivation:

the build doesnt seem to enforce this, so they piled up

Modifications:

removed unused import lines

Result:

less unused imports

Signed-off-by: radai-rosenblatt <radai.rosenblatt@gmail.com>
2016-09-30 09:08:50 +02:00
buchgr
67d3a78123 Reduce bytecode size of PlatformDependent0.equals.
Motivation:

PP0.equals has a bytecode size of 476. This is above the default inlining threshold of OpenJDK.

Modifications:

Slightly change the method to reduce the bytecode size by > 50% to 212 bytes.

Result:

The bytecode size is dramatically reduced, making the method a candidate for inlining.
The relevant code in our application (gRPC) that relies heavily on equals comparisons,
runs some ~10% faster. The Netty JMH benchmark shows no performance regression.

Current 4.1:

PlatformDependentBenchmark.unsafeBytesEqual      10  avgt   20     7.836 ±   0.113  ns/op
PlatformDependentBenchmark.unsafeBytesEqual      50  avgt   20    16.889 ±   4.284  ns/op
PlatformDependentBenchmark.unsafeBytesEqual     100  avgt   20    15.601 ±   0.296  ns/op
PlatformDependentBenchmark.unsafeBytesEqual    1000  avgt   20    95.885 ±   1.992  ns/op
PlatformDependentBenchmark.unsafeBytesEqual   10000  avgt   20   824.429 ±  12.792  ns/op
PlatformDependentBenchmark.unsafeBytesEqual  100000  avgt   20  8907.035 ± 177.844  ns/op

With this change:

PlatformDependentBenchmark.unsafeBytesEqual      10  avgt   20      5.616 ±   0.102  ns/op
PlatformDependentBenchmark.unsafeBytesEqual      50  avgt   20     17.896 ±   0.373  ns/op
PlatformDependentBenchmark.unsafeBytesEqual     100  avgt   20     14.952 ±   0.210  ns/op
PlatformDependentBenchmark.unsafeBytesEqual    1000  avgt   20     94.799 ±   1.604  ns/op
PlatformDependentBenchmark.unsafeBytesEqual   10000  avgt   20    834.996 ±  17.484  ns/op
PlatformDependentBenchmark.unsafeBytesEqual  100000  avgt   20   8757.421 ± 187.555  ns/op
2016-09-09 07:57:41 +02:00