Commit Graph

7 Commits

Author SHA1 Message Date
Nick Hill
9f2221ebd4 CompositeByteBuf optimizations and new addFlattenedComponents method (#8939)
Motivation:

The CompositeByteBuf discardReadBytes / discardReadComponents methods are currently quite inefficient, including when there are no read components to discard. We would like to call the latter more frequently in ByteToMessageDecoder#COMPOSITE_CUMULATOR.

In the same context it would be beneficial to perform a "shallow copy" of a composite buffer (for example when it has a refcount > 1) to avoid having to allocate and copy the contained bytes just to obtain an "independent" cumulation.

Modifications:

- Optimize discardReadBytes() and discardReadComponents() implementations (start at first comp rather than performing a binary search for the readerIndex).
- New addFlattenedComponents(boolean,ByteBuf) method which performs a shallow copy if the provided buffer is also composite and avoids adding any empty buffers, plus unit test.
- Other minor optimizations to avoid unnecessary checks.

Results:

discardReadXX methods are faster, composite buffers can be easily appended without deepening the buffer "tree" or retaining unused components.
2019-04-08 20:48:08 +02:00
Nick Hill
0811409ca3 Further reduce ensureAccessible() overhead (#8895)
Motivation:

This PR fixes some non-negligible overhead discovered in the ByteBuf
accessibility (non-zero refcount) checking. The cause turned out to be
mostly twofold:
- Unnecessary operations used to calculate the refcount from the "raw"
encoded int field value
- Call stack depths exceeding the default limit for inlining, in some
places (CompositeByteBuf in particular)

It's a follow-on from #8882 which uses the maxCapacity field for a
simpler non-negative check. The performance gap between these two
variants appears to be _mostly_ closed, but there's one exception which
may warrant further analysis.

Modifications:

- Replace ABB.internalRefCount() with ByteBuf.isAccessible(), the
default still checks for non-zero refCnt()
- Just test for parity of raw refCnt instead of converting to "real",
with fast-path for specific small values
- Make sure isAccessible() is delegated by derived/wrapper ByteBufs
- Use existing freed flag in CompositeByteBuf for faster isAccessible()
- Manually inline some calls in methods like CompositeByteBuf.setLong()
and AbstractReferenceCountedByteBuf.isAccessible() to reduce stack
depths (to ensure default inlining limit isn't hit)
- Add ByteBufAccessBenchmark which is an extension of
UnsafeByteBufBenchmark (maybe latter could now be removed)

Results:

Before:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
84524972.863 ±   518338.811  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
38608795.037 ±   298176.974  ops/s
readBatch           HEAP               true           true  thrpt   30
80003697.649 ±   974674.119  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
18495554.788 ±   108075.023  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
247069881.578 ± 10839162.593  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
196355905.206 ±  1802420.990  ops/s
setGetLong          HEAP               true           true  thrpt   30
245686644.713 ± 11769311.527  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
83170940.687 ±   657524.123  ops/s
setLong           UNSAFE               true           true  thrpt   30
278940253.918 ±  1807265.259  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
202556738.764 ± 11887973.563  ops/s
setLong             HEAP               true           true  thrpt   30
280045958.053 ±  2719583.400  ops/s
setLong        COMPOSITE               true           true  thrpt   30
121299806.002 ±  2155084.707  ops/s


After:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
101641801.035 ±  3950050.059  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
84395902.846 ±  4339579.057  ops/s
readBatch           HEAP               true           true  thrpt   30
100179060.207 ±  3222487.287  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
42288494.472 ±   294919.633  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
304530755.027 ±  6574163.899  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
212028547.645 ± 14277828.768  ops/s
setGetLong          HEAP               true           true  thrpt   30
309335422.609 ±  2272150.415  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
160383609.236 ±   966484.033  ops/s
setLong           UNSAFE               true           true  thrpt   30
298055969.747 ±  7437449.627  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
223784178.650 ±  9869750.095  ops/s
setLong             HEAP               true           true  thrpt   30
302543263.328 ±  8140104.706  ops/s
setLong        COMPOSITE               true           true  thrpt   30
157083673.285 ±  3528779.522  ops/s

There's also a similar knock-on improvement to other benchmarks (e.g.
HPACK encoding/decoding) as shown in #8882.

For sanity I did a final comparison of the "fast path" tweak using one
of the HPACK benchmarks:

(rawCnt & 1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score     Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  50914.479 ± 940.114  ops/s


rawCnt == 2 || rawCnt == 4 || rawCnt == 6 || rawCnt == 8 ||  (rawCnt &
1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score      Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  60036.425 ± 1478.196  ops/s
2019-02-28 20:40:41 +01:00
Norman Maurer
e542a2cf26
Use a non-volatile read for ensureAccessible() whenever possible to reduce overhead and allow better inlining. (#8266)
Motiviation:

At the moment whenever ensureAccessible() is called in our ByteBuf implementations (which is basically on each operation) we will do a volatile read. That per-se is not such a bad thing but the problem here is that it will also reduce the the optimizations that the compiler / jit can do. For example as these are volatile it can not eliminate multiple loads of it when inline the methods of ByteBuf which happens quite frequently because most of them a quite small and very hot. That is especially true for all the methods that act on primitives.

It gets even worse as people often call a lot of these after each other in the same method or even use method chaining here.

The idea of the change is basically just ue a non-volatile read for the ensureAccessible() check as its a best-effort implementation to detect acting on already released buffers anyway as even with a volatile read it could happen that the user will release it in another thread before we actual access the buffer after the reference check.

Modifications:

- Try to do a non-volatile read using sun.misc.Unsafe if we can use it.
- Add a benchmark

Result:

Big performance win when multiple ByteBuf methods are called from a method.

With the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf  thrpt   20  281395842,128 ± 5050792,296  ops/s

Before the change:
UnsafeByteBufBenchmark.setGetLongUnsafeByteBuf  thrpt   20  217419832,801 ± 5080579,030  ops/s
2018-09-07 07:47:02 +02:00
Norman Maurer
7b25402e80 Add CompositeByteBuf.addComponent(boolean ...) method to simplify usage
Motivation:

At the moment the user is responsible to increase the writer index of the composite buffer when a new component is added. We should add some methods that handle this for the user as this is the most popular usage of the composite buffer.

Modifications:

Add new methods that autoamtically increase the writerIndex when buffers are added.

Result:

Easier usage of CompositeByteBuf.
2016-05-21 19:52:16 +02:00
Norman Maurer
2a14f74979 Ensure all methods are correctly override in *LeakAware*ByteBuf implementations
Motivation:

We missed to override a few methods and so some actions on the ByteBuf failed.

Modifications:

- Override all methods
- Add unit tests to ensure all is fixed.

Result:

All *LeakAware*ByteBuf have correct implementations
2016-05-21 07:28:13 +02:00
Trustin Lee
3a9f472161 Make retained derived buffers recyclable
Related: #4333 #4421 #5128

Motivation:

slice(), duplicate() and readSlice() currently create a non-recyclable
derived buffer instance. Under heavy load, an application that creates a
lot of derived buffers can put the garbage collector under pressure.

Modifications:

- Add the following methods which creates a non-recyclable derived buffer
  - retainedSlice()
  - retainedDuplicate()
  - readRetainedSlice()
- Add the new recyclable derived buffer implementations, which has its
  own reference count value
- Add ByteBufHolder.retainedDuplicate()
- Add ByteBufHolder.replace(ByteBuf) so that..
  - a user can replace the content of the holder in a consistent way
  - copy/duplicate/retainedDuplicate() can delegate the holder
    construction to replace(ByteBuf)
- Use retainedDuplicate() and retainedSlice() wherever possible
- Miscellaneous:
  - Rename DuplicateByteBufTest to DuplicatedByteBufTest (missing 'D')
  - Make ReplayingDecoderByteBuf.reject() return an exception instead of
    throwing it so that its callers don't need to add dummy return
    statement

Result:

Derived buffers are now recycled when created via retainedSlice() and
retainedDuplicate() and derived from a pooled buffer
2016-05-17 11:16:13 +02:00
Norman Maurer
cdb70d31ee [#4017] Implement proper resource leak detection for CompositeByteBuf
Motivation:

CompositeByteBuf only implemented simple resource leak detection and how it was implemented was completly different to the way it was for ByteBuf. The other problem was that slice(), duplicate() and others would not return a resource leak enabled buffer.

Modifications:

- Proper implementation for all level of resource leak detection for CompositeByteBuf

Result:

Proper resource leak detection for CompositeByteBuf.
2016-01-18 10:14:39 +01:00