Commit Graph

862 Commits

Author SHA1 Message Date
Nick Hill
d29f91540d Fix possible unsafe sharing of internal NIO buffer in CompositeByteBuf (#9169)
Motivation

A small thread-safety bug was introduced during the internal
optimizations of ComponentByteBuf made a while back in #8437. When there
is a single component which was added as a slice,
internalNioBuffer(int,int) will currently return the unwrapped slice's
un-duplicated internal NIO buffer. This is not safe since it could be
modified concurrently with other usage of that parent buffer.

Modifications

Delegate internalNioBuffer to nioBuffer in this case, which returns a
duplicate. This matches what's done in derived buffers in general
(for the same reason). Add unit test.

Result

Fixed possible thread-safety bug
2019-05-22 11:07:27 +02:00
Tim Brooks
e4ef8f6ff3 Prefer direct io buffers if direct buffers pooled (#9167)
Motivation

Direct buffers are normally preferred when interfacing with raw
sockets. Currently netty will only return direct io buffers (for reading
from a channel) when a platform has unsafe. However, this is
inconsistent with the write-side (filterOutboundMessage) where a direct
byte buffer will be returned if pooling is enabled. This means that
environments without unsafe (and no manual netty configurations) end up
with many pooled heap byte buffers for reading, many pooled direct byte
buffers for writing, and jdk pooled byte buffers (for reading).

Modifications

This commit modifies the AbstractByteBufAllocator to return a direct
byte buffer for io handling when the platform has unsafe or direct byte
buffers are pooled.

Result:

Use direct buffers when direct buffers are pooled for IO.
2019-05-22 07:33:06 +02:00
Nick Hill
5e786e549e Fix incorrect behavior of ReadOnlyByteBufferBuf.getBytes(int,ByteBuffer) (#9125)
* Fix incorrect behavior of ReadOnlyByteBufferBuf.getBytes(int,ByteBuffer)

Motivation

It currently will succeed when the destination is larger than the source
range, but the ByteBuf javadoc states this should be a failure, as is
the case with all the other implementations.

Modifications

- Fix logic to fail the bounds check in this case
- Remove explicit null check which isn't done in any equivalent method
- Add unit test

Result

More correct/consistent behaviour
2019-05-13 07:03:10 +02:00
Norman Maurer
7e6d1bbcfd Fix regression in CompositeByteBuf.discard*ReadBytes() (#9068)
Motivation:

1f93bd3 introduced a regression that could lead to not have the lastAccessed field correctly null'ed out when the endOffset of the internal Component == CompositeByteBuf.readerIndex()

Modifications:

- Correctly null out the lastAccessed field in any case
- Add unit tests

Result:

Fixes regression in CompositeByteBuf.discard*ReadBytes()
2019-04-17 18:03:29 +02:00
Nick Hill
e75b7edaa3 Centralize internal reference counting logic (#8614)
Motivation

AbstractReferenceCounted and AbstractReferenceCountedByteBuf contain
duplicate logic for managing the volatile refcount in an optimized and
consistent manner, which increased in complexity in #8583. It's possible
to extract this into a common helper class now that all access is via an
AtomicIntegerFieldUpdater.

Modifications

- Move duplicate logic into a shared ReferenceCountUpdater class
- Incorporate some additional simplification for the most common single
increment/decrement cases (fewer checks/operations)

Result

Less code duplication, better encapsulation of the "non-trivial"
internal volatile refcount manipulation
2019-04-09 16:23:15 +02:00
Nick Hill
1f93bd36b6 CompositeByteBuf optimizations and new addFlattenedComponents method (#8939)
Motivation:

The CompositeByteBuf discardReadBytes / discardReadComponents methods are currently quite inefficient, including when there are no read components to discard. We would like to call the latter more frequently in ByteToMessageDecoder#COMPOSITE_CUMULATOR.

In the same context it would be beneficial to perform a "shallow copy" of a composite buffer (for example when it has a refcount > 1) to avoid having to allocate and copy the contained bytes just to obtain an "independent" cumulation.

Modifications:

- Optimize discardReadBytes() and discardReadComponents() implementations (start at first comp rather than performing a binary search for the readerIndex).
- New addFlattenedComponents(boolean,ByteBuf) method which performs a shallow copy if the provided buffer is also composite and avoids adding any empty buffers, plus unit test.
- Other minor optimizations to avoid unnecessary checks.

Results:

discardReadXX methods are faster, composite buffers can be easily appended without deepening the buffer "tree" or retaining unused components.
2019-04-08 20:55:19 +02:00
Norman Maurer
1eabb752c0 Fix NPE that was encounter by debugger (will never happen in real code). (#8992)
Motivation:

We synchronize on the chunk.arena when produce the String returned by PoolSubpage.toString() which may raise a NPE when chunk == null. Chunk == null for the head of the linked-list and so a NPE may raised by a debugger. This NPE can never happen in real code tho as we never access toString() of the head.

Modifications:

Add null checks and so fix the possible NPE

Result:

No NPE when using a debugger and inspect the PooledByteBufAllocator.
2019-04-01 19:44:45 +02:00
Norman Maurer
78c4abbfb7
Change default of io.netty.allocator.useCacheForAllThreads to false (#8991)
Motivation:

We currently use a thread local cache for all threads which often is suprising to users as it may result in a lot of memory usage if they allocate buffers from outside the EventLoop in different threads. We should better not do this by default to keep suprises to a minimum. Users that need the performance and know what they are doing can still change this.

Modifications:

Change io.netty.allocator.useCacheForAllThreads to false by default

Result:

Related to https://github.com/netty/netty/issues/8536.
2019-04-01 19:43:19 +02:00
Norman Maurer
ab7accb2de Don't try to put back MemoryRegionCache.Entry objects into the Recycler when recycled because of a finalizer. (#8955)
Motivation:

In MemoryRegionCache.Entry we use the Recycler to reduce GC pressure and churn. The problem is that these will also be recycled when the PoolThreadCache is collected and finalize() is called. This then can have the effect that we try to load class but the WebApp is already stoped.

This will produce an stacktrace like this on Tomcat:

```
19-Mar-2019 15:53:21.351 INFO [Finalizer] org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading Illegal access: this web application instance has been stopped already. Could not load [java.util.WeakHashMap]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
 java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [java.util.WeakHashMap]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1383)
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1371)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1224)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1186)
	at io.netty.util.Recycler$3.initialValue(Recycler.java:233)
	at io.netty.util.Recycler$3.initialValue(Recycler.java:230)
	at io.netty.util.concurrent.FastThreadLocal.initialize(FastThreadLocal.java:188)
	at io.netty.util.concurrent.FastThreadLocal.get(FastThreadLocal.java:142)
	at io.netty.util.Recycler$Stack.pushLater(Recycler.java:624)
	at io.netty.util.Recycler$Stack.push(Recycler.java:597)
	at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:225)
	at io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry.recycle(PoolThreadCache.java:478)
	at io.netty.buffer.PoolThreadCache$MemoryRegionCache.freeEntry(PoolThreadCache.java:459)
	at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:430)
	at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:422)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:279)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:270)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:241)
	at io.netty.buffer.PoolThreadCache.finalize(PoolThreadCache.java:230)
	at java.lang.System$2.invokeFinalize(System.java:1270)
	at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:102)
	at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:217)
```

Beside this we also need to ensure we not try to lazy load SizeClass when the finalizer is used as it may not be present anymore if the ClassLoader is already destroyed.

This would produce an error like:

```
20-Mar-2019 11:26:35.254 INFO [Finalizer] org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading Illegal access: this web application instance has been stopped already. Could not load [io.netty.buffer.PoolArena$1]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
 java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [io.netty.buffer.PoolArena$1]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1383)
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1371)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1224)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1186)
	at io.netty.buffer.PoolArena.freeChunk(PoolArena.java:287)
	at io.netty.buffer.PoolThreadCache$MemoryRegionCache.freeEntry(PoolThreadCache.java:464)
	at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:429)
	at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:421)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:278)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:269)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:240)
	at io.netty.buffer.PoolThreadCache.finalize(PoolThreadCache.java:229)
	at java.lang.System$2.invokeFinalize(System.java:1270)
	at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:102)
	at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:217)
```

Modifications:

- Only try to put the Entry back into the Recycler if the PoolThredCache is not destroyed because of the finalizer.
- Only try to access SizeClass if not triggered by finalizer.

Result:

No IllegalStateException anymoe when a webapp is reloaded in Tomcat that uses netty and uses the PooledByteBufAllocator.
2019-03-22 12:16:53 +01:00
Nick Hill
a71e33ae35 Fix possible ByteBuf leak when CompositeByteBuf is resized (#8946)
Motivation:

The special case fixed in #8497 also requires that we keep a derived slice when trimming components in place, as done by the capacity(int) and discardReadBytes() methods.

Modifications:

Ensure that we keep a ref to trimmed components' original retained slice in capacity(int) and discardReadBytes() methods, so that it is released properly when the they are later freed. Add unit test which fails prior to the fix.

Result:

Edge case leak is eliminated.
2019-03-22 11:18:28 +01:00
Norman Maurer
9681842d54 Allow to automatically trim the PoolThreadCache in a timely interval (#8941)
Motivation:

PooledByteBufAllocator uses a PoolThreadCache per Thread that allocates / deallocates to minimize the performance overhead. This PoolThreadCache is trimmed after X allocations to free up buffers that are not allocated for a long time. This works out quite well when the app continues to allocate but fails if the app stops to allocate frequently (for whatever reason) and so a lot of memory is wasted and not given back to the arena / freed.

Modifications:

- Add a ThreadExecutorMap that offers multiple methods that wrap Runnable / ThreadFactory / Executor and allow to call ThreadExecutorMap.currentEventExecutor() to get the current executing EventExecutor for the calling Thread.
- Use these methods in the constructors of our EventExecutor implementations (which also covers the EventLoop implementations)
- Add io.netty.allocator.cacheTrimIntervalMillis system property which can be used to specify a fixed rate / interval on which we should try to trim the PoolThreadCache for a EventExecutor that allocates.
- Add PooledByteBufAllocator.trimCurrentThreadCache() to allow the user to trim the cache of the calling thread manually.
- Add testcases
- Introduce FastThreadLocal.getIfExists()

Result:

Allow to better / more frequently trim PoolThreadCache and so give back memory to the area / system.
2019-03-22 11:17:40 +01:00
Norman Maurer
795f9600ba Adjust tests to be able to build / test when using IBM J9 / OpenJ9 (#8900)
Motivation:

We should run a CI job using J9 to ensure netty also works when using different JVMs.

Modifications:

- Adjust PooledByteBufAllocatorTest to be able to complete faster when using a JVM which takes longer when joining Threads (this seems to be the case with J9).

Result:

Be able to run CI against J9.
2019-03-01 06:48:35 +01:00
Nick Hill
35161ad174 Further reduce ensureAccessible() overhead (#8895)
Motivation:

This PR fixes some non-negligible overhead discovered in the ByteBuf
accessibility (non-zero refcount) checking. The cause turned out to be
mostly twofold:
- Unnecessary operations used to calculate the refcount from the "raw"
encoded int field value
- Call stack depths exceeding the default limit for inlining, in some
places (CompositeByteBuf in particular)

It's a follow-on from #8882 which uses the maxCapacity field for a
simpler non-negative check. The performance gap between these two
variants appears to be _mostly_ closed, but there's one exception which
may warrant further analysis.

Modifications:

- Replace ABB.internalRefCount() with ByteBuf.isAccessible(), the
default still checks for non-zero refCnt()
- Just test for parity of raw refCnt instead of converting to "real",
with fast-path for specific small values
- Make sure isAccessible() is delegated by derived/wrapper ByteBufs
- Use existing freed flag in CompositeByteBuf for faster isAccessible()
- Manually inline some calls in methods like CompositeByteBuf.setLong()
and AbstractReferenceCountedByteBuf.isAccessible() to reduce stack
depths (to ensure default inlining limit isn't hit)
- Add ByteBufAccessBenchmark which is an extension of
UnsafeByteBufBenchmark (maybe latter could now be removed)

Results:

Before:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
84524972.863 ±   518338.811  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
38608795.037 ±   298176.974  ops/s
readBatch           HEAP               true           true  thrpt   30
80003697.649 ±   974674.119  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
18495554.788 ±   108075.023  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
247069881.578 ± 10839162.593  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
196355905.206 ±  1802420.990  ops/s
setGetLong          HEAP               true           true  thrpt   30
245686644.713 ± 11769311.527  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
83170940.687 ±   657524.123  ops/s
setLong           UNSAFE               true           true  thrpt   30
278940253.918 ±  1807265.259  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
202556738.764 ± 11887973.563  ops/s
setLong             HEAP               true           true  thrpt   30
280045958.053 ±  2719583.400  ops/s
setLong        COMPOSITE               true           true  thrpt   30
121299806.002 ±  2155084.707  ops/s


After:

Benchmark   (bufferType)  (checkAccessible)  (checkBounds)   Mode  Cnt
Score          Error  Units
readBatch         UNSAFE               true           true  thrpt   30
101641801.035 ±  3950050.059  ops/s
readBatch   UNSAFE_SLICE               true           true  thrpt   30
84395902.846 ±  4339579.057  ops/s
readBatch           HEAP               true           true  thrpt   30
100179060.207 ±  3222487.287  ops/s
readBatch      COMPOSITE               true           true  thrpt   30
42288494.472 ±   294919.633  ops/s
setGetLong        UNSAFE               true           true  thrpt   30
304530755.027 ±  6574163.899  ops/s
setGetLong  UNSAFE_SLICE               true           true  thrpt   30
212028547.645 ± 14277828.768  ops/s
setGetLong          HEAP               true           true  thrpt   30
309335422.609 ±  2272150.415  ops/s
setGetLong     COMPOSITE               true           true  thrpt   30
160383609.236 ±   966484.033  ops/s
setLong           UNSAFE               true           true  thrpt   30
298055969.747 ±  7437449.627  ops/s
setLong     UNSAFE_SLICE               true           true  thrpt   30
223784178.650 ±  9869750.095  ops/s
setLong             HEAP               true           true  thrpt   30
302543263.328 ±  8140104.706  ops/s
setLong        COMPOSITE               true           true  thrpt   30
157083673.285 ±  3528779.522  ops/s

There's also a similar knock-on improvement to other benchmarks (e.g.
HPACK encoding/decoding) as shown in #8882.

For sanity I did a final comparison of the "fast path" tweak using one
of the HPACK benchmarks:

(rawCnt & 1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score     Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  50914.479 ± 940.114  ops/s


rawCnt == 2 || rawCnt == 4 || rawCnt == 6 || rawCnt == 8 ||  (rawCnt &
1) == 0:

Benchmark                     (limitToAscii)  (sensitive)  (size)   Mode
Cnt      Score      Error  Units
HpackDecoderBenchmark.decode            true         true  MEDIUM  thrpt
30  60036.425 ± 1478.196  ops/s
2019-02-28 20:41:16 +01:00
Dmitriy Dumanskiy
116f72db8d Legacy properties removed (#8839)
Motivation:

We can remove some properties for which we introduced replacements.

Modifications:

io.netty.buffer.bytebuf.checkAccessible, io.netty.leakDetectionLevel, org.jboss.netty.tryUnsafe properties removed

Result:

Code cleanup
2019-02-04 13:56:15 +01:00
田欧
e8efcd82a8 migrate java8: use requireNonNull (#8840)
Motivation:

We can just use Objects.requireNonNull(...) as a replacement for ObjectUtil.checkNotNull(....)

Modifications:

- Use Objects.requireNonNull(...)

Result:

Less code to maintain.
2019-02-04 10:32:25 +01:00
Nick Hill
0ac8db0307 CompositeByteBuf tidy-up (#8784)
Motivation

There's some miscellaneous cleanup/simplification of CompositeByteBuf
which would help make the code a bit clearer.

Modifications

- Simplify web of constructors and addComponents methods, reducing
duplication of logic
- Rename `Component.freeIfNecessary()` method to just `free()`, which is
less confusing (see #8641)
- Make loop in addComponents0(...) method more verbose/readable (see
https://github.com/netty/netty/pull/8437#discussion_r232124414)
- Simplify addition/subtraction in setBytes(...) methods

Result

Smaller/clearer code
2019-02-01 10:32:02 +01:00
田欧
d7648f1d93 use checkPositive/checkPositiveOrZero (#8803)
Motivation:

We have a utility method to check for > 0 and >0 arguments. We should use it.

Modification:

use checkPositive/checkPositiveOrZero instead of if statement.

Result:

Re-use utility method.
2019-01-31 09:06:59 +01:00
田欧
6222101924 migrate java8: use lambda and method reference (#8781)
Motivation:

We can use lambdas now as we use Java8.

Modification:

use lambda function for all package, #8751 only migrate transport package.

Result:

Code cleanup.
2019-01-29 14:06:05 +01:00
田欧
e941cbe27a remove unused import statement (#8792)
Motivation:
The code contained some unused import statements.

Modification:
Remove unused import statements.

Result:
Code cleanup
2019-01-28 16:50:15 +01:00
kezhenxu94
7b6336f1fd Java 8 Migration: remove uneccessary if statement (#8755)
Motivation:

As netty 4.x supported Java 6 we had various if statements to check for java versions < 8. We can remove these now.

Modification:

Remove unnecessary if statements that check for java versions < 8.

Result:

Cleanup code.
2019-01-25 08:57:11 +01:00
Norman Maurer
310f31b392
Update to new checkstyle plugin (#8777)
Motivation:

We need to update to a new checkstyle plugin to allow the usage of lambdas.

Modifications:

- Update to new plugin version.
- Fix checkstyle problems.

Result:

Be able to use checkstyle plugin which supports new Java syntax.
2019-01-24 16:24:19 +01:00
Nick Hill
9e7385ddba Fix three bugs in CompositeByteBuf (#8773)
Motivation

In #8758, @doom369 reported an infinite loop bug in CompositeByteBuf
which was introduced in #8437.

This is the same small fix for that, along with fixes for two other bugs
found while re-inspecting the changes and adding unit tests.

Modification

- Replace recursive call to toComponentIndex with toComponentIndex0 as
intended
- Add missed "lastAccessed" racy cache invalidation in capacity(int)
method
- Fix incorrect determination of initial offset in non-zero cIndex case
of updateComponentOffsets method
- New unit tests for previously uncovered methods

Results

Fewer bugs.
2019-01-24 14:59:33 +01:00
Dmitriy Dumanskiy
4a10357fd8 IDE warnings cleanup (#8768)
Motivation:

IDE shows some warnings

Modification:

Small IDE warnings cleanup in different places.

Result:

Less warnings
2019-01-23 14:01:48 +01:00
Dmitriy Dumanskiy
7b92ff2500 Java 8 migration. Remove ThreadLocalProvider and inline java.util.concurrent.ThreadLocalRandom.current() where necessary. (#8762)
Motivation:

Custom Netty ThreadLocalRandom and ThreadLocalRandomProvider classes are no longer needed and can be removed.

Modification:

Remove own ThreadLocalRandom

Result:

Less code to maintain
2019-01-22 20:14:28 +01:00
田欧
9d62deeb6f Java 8 migration: Use diamond operator (#8749)
Motivation:

We can use the diamond operator these days.

Modification:

Use diamond operator whenever possible.

Result:

More modern code and less boiler-plate.
2019-01-22 16:07:26 +01:00
Dmitriy Dumanskiy
b2a9a8e9fe Java 8 migration. Removed custom LongCounter and LongAdderCounter classes and replaced with direct reference to LongAdder (#8750)
Motivation:

After switching to Java 8 we no longer need LongCounter and LongAdderCounter. We can directly replace them with LongAdder.

Modification:

Removed LongCounter and LongAdderCounter classes.

Result:

Less code to maintain
2019-01-22 13:53:28 +01:00
Derek Lewis
13f640cbba Updating ByteBuf Javadocs to represent actual behaviour. (#8709)
Motivation:

The javadocs stating `IndexOutOfBoundsException` is thrown were
different from what `ByteBuf` actually did. We want to ensure the
Javadocs represent reality.

Modifications:

Updated javadocs on `write*`, `ensureWriteable`, `capacity`, and
`maxCapacity` methods.

Results:

Javadocs more closely match actual behaviour.
2019-01-14 20:09:18 +01:00
kashike
c0aa1ea5c7 Fix minor spelling issues in javadocs (#8701)
Motivation:

Javadocs contained some spelling errors, we should fix these.

Modification:

Fix spelling

Result:

Javadoc cleanup.
2019-01-14 07:25:13 +01:00
Norman Maurer
d9a6cf341c
Remove support for marking reader and writerIndex in ByteBuf to reduce overhead and complexity. (#8636)
Motivation:

ByteBuf supports “marker indexes”. The intended use case for these is if a speculative operation (e.g. decode) is in process the user can “mark” and interface and refer to it later if the operation isn’t successful (e.g. not enough data). However this is rarely used in practice,
requires extra memory to maintain, and introduces complexity in the state management for derived/pooled buffer initialization, resizing, and other operations which may modify reader/writer indexes.

Modifications:

Remove support for marking and adjust testcases / code.

Result:

Fixes https://github.com/netty/netty/issues/8535.
2018-12-11 14:00:49 +01:00
Nick Travers
0a8c93a6fa Loosen bounds check on CompositeByteBuf's maxNumComponents (#8621)
Motivation:

In versions of Netty prior to 4.1.31.Final, a CompositeByteBuf could be
created with any size (including potentially nonsensical negative
values). This behavior changed in e7737b993, which introduced a bounds
check to only allow for a component size greater than one. This broke
some existing use cases that attempted to create a byte buf with a
single component.

Modifications:

Lower the bounds check on numComponents to include the single component
case, but still throw an exception for anything less than one.

Add unit tests for the case of numComponents being less than, equal to,
and greater than this lower bound.

Result:

Return to the behavior of 4.1.30.Final, allowing one component, but
still include an explicit check against a lower bound.

Note that while creating a CompositeByteBuf with a single component is
in some ways a contradiction of the term "composite", this patch caters
for existing uses while excluding the clearly nonsensical case of asking
for a CompositeByteBuf with zero or fewer components.

Fixes #8613.
2018-12-05 08:43:53 +01:00
Norman Maurer
85c1590a90 Provide a way to cache the internal nioBuffer of the PooledByteBuffer… (#8603)
Motivation:

Often a temporary ByteBuffer is used which can be cached to reduce the GC pressure.

Modifications:

Cache the ByteBuffer in the PoolThreadCache as well.

Result:

Less GC.
2018-12-04 15:39:55 +01:00
Nick Hill
fc0a4e15ab Harden ref-counting concurrency semantics (#8583)
Motivation

#8563 highlighted race conditions introduced by the prior optimistic
update optimization in 83a19d5650. These
were known at the time but considered acceptable given the perf
benefit in high contention scenarios.

This PR proposes a modified approach which provides roughly half the
gains but stronger concurrency semantics. Race conditions still exist
but their scope is narrowed to much less likely cases (releases
coinciding with retain overflow), and even in those
cases certain guarantees are still assured. Once release() returns true,
all subsequent release/retains are guaranteed to throw, and in
particular deallocate will be called at most once.

Modifications

- Use even numbers internally (including -ve) for live refcounts
- "Final" release changes to odd number (equivalent to refcount 0)
- Retain still uses faster getAndAdd, release uses CAS loop
- First CAS attempt uses non-volatile read
- Thread.yield() after a failed CAS provides a net gain

Result

More (though not completely) robust concurrency semantics for ref
counting; increased latency under high contention, but still roughly
twice as fast as the original logic. Bench results to follow
2018-11-29 08:32:50 +01:00
Norman Maurer
42e0ce6f7a Move less common code-path to extra method to allow inlining of writeUtf8. (#8600)
Motivation:

ByteBuf is used everywhere so we should try hard to be able to make things inlinable. During benchmarks it showed that writeCharSequence(...) fails to inline writeUtf8 because it is too big even if its hots.

Modifications:

Move less common code-path to extra method to allow inlining.

Result:

Be able to inline writeUtf8 in most cases.
2018-11-27 21:04:13 +01:00
Norman Maurer
45e9fd4802 Revert "Provide a way to cache the internal nioBuffer of the PooledByteBuffer to reduce GC. (#8593)"
This reverts commit 945f123f8b as it seems to produce some failures in some cases. This needs more research.
2018-11-27 20:03:33 +01:00
Norman Maurer
945f123f8b Provide a way to cache the internal nioBuffer of the PooledByteBuffer to reduce GC. (#8593)
Motivation:

Often a temporary ByteBuffer is used which can be cached to reduce the GC pressure.

Modifications:

Add a Deque per PoolChunk which will be used for caching.

Result:

Less GC.
2018-11-27 13:55:24 +01:00
Rolandz
836c39b82f Fix offset calculation in PooledByteBufAllocator when used
Motivation:

When we create new chunk with memory aligned, the offset of direct memory should be
'alignment - address & (alignment - 1)', not just 'address & (alignment - 1)'.

Modification:

Change offset calculating formula to offset = alignment - address & (alignment - 1) in PoolArena.DirectArena#offsetCacheLine and add a unit test to assert that.

Result:

Correctly calculate offset.
2018-11-27 11:47:40 +01:00
Norman Maurer
2c78dde749 Update version number to start working on Netty 5 2018-11-20 15:49:57 +01:00
Nick Hill
804e1fa9cc Fix ref-counting when CompositeByteBuf is used with retainedSlice() (#8497)
Motivation:

ByteBuf.retainedSlice() and similar methods produce sliced buffers with
an independent refcount to the buffer that they wrap.


One of the optimizations in 10539f4dc7 was
to use the ref to the unwrapped buffer object for added slices, but this
did not take into account the above special case when later releasing.

Thanks to @rkapsi for discovering this via #8495.

Modifications:

Since a reference to the slice is still kept in the Component class,
just changed Component.freeIfNecessary() to release the slice in
preference to the unwrapped buf.

Also added a unit test which reproduces the bug.

Result:

Fixes #8495
2018-11-13 20:56:09 +01:00
Nick Hill
0f8ce1b284 Fix incorrect sizing of temp byte arrays in (Unsafe)ByteBufUtil (#8484)
Motivation:

Two similar bugs were introduced by myself in separate recent PRs #8393
and #8464, while optimizing the assignment/handling of temporary arrays
in ByteBufUtil and UnsafeByteBufUtil.

The temp arrays allocated for buffering data written to an OutputStream
are incorrectly sized to the full length of the data to copy rather than
being capped at WRITE_CHUNK_SIZE.

Unfortunately one of these is in the 4.1.31.Final release, I'm really
sorry and will be more careful in future.

This kind of thing is tricky to cover in unit tests.

Modifications:

Revert the temp array allocations back to their original sizes.

Avoid making duplicate calls to ByteBuf.capacity() in a couple of places
in ByteBufUtil (unrelated thing I noticed, can remove it from this PR if
desired!)

Result:

Temporary byte arrays will be reverted to their originally intended
sizes.
2018-11-09 18:24:38 +01:00
Nick Hill
5954110b9a Use ByteBufUtil.BYTE_ARRAYS ThreadLocal temporary arrays in more places (#8464)
Motivation:

#8388 introduced a reusable ThreadLocal<byte[]> for use in
decodeString(...). It can be used in more places in the buffer package
to avoid temporary allocations of small arrays.

Modifications:

Encapsulate use of the ThreadLocal in a static package-private
ByteBufUtil.threadLocalTempArray(int) method, and make use of it from a
handful of new places including ByteBufUtil.readBytes(...).

Result:

Fewer short-lived small byte array allocations.
2018-11-05 21:11:28 +01:00
Nick Hill
10539f4dc7 Streamline CompositeByteBuf internals (#8437)
Motivation:

CompositeByteBuf is a powerful and versatile abstraction, allowing for
manipulation of large data without copying bytes. There is still a
non-negligible cost to reading/writing however relative to "singular"
ByteBufs, and this can be mostly eliminated with some rework of the
internals.

My use case is message modification/transformation while zero-copy
proxying. For example replacing a string within a large message with one
of a different length

Modifications:

- No longer slice added buffers and unwrap added slices
   - Components store target buf offset relative to position in
composite buf
   - Less allocations, object footprint, pointer indirection, offset
arithmetic
- Use Component[] rather than ArrayList<Component>
   - Avoid pointer indirection and duplicate bounds check, more
efficient backing array growth
   - Facilitates optimization when doing bulk-inserts - inserting n
ByteBufs behind m is now O(m + n) instead of O(mn)
- Avoid unnecessary casting and method call indirection via superclass
- Eliminate some duplicate range/ref checks via non-checking versions of
toComponentIndex and findComponent
- Add simple fast-path for toComponentIndex(0); add racy cache of
last-accessed Component to findComponent(int)
- Override forEachByte0(...) and forEachByteDesc0(...) methods
- Make use of RecyclableArrayList in nioBuffers(int, int) (in line with
FasterCompositeByteBuf impl)
- Modify addComponents0(boolean,int,Iterable) to use the Iterable
directly rather than copy to an array first (and possibly to an
ArrayList before that)
- Optimize addComponents0(boolean,int,ByteBuf[],int) to not perform
repeated array insertions and avoid second loop for offset updates
- Simplify other logic in various places, in particular the general
pattern used where a sub-range is iterated over
- Add benchmarks to demonstrate some improvements

While refactoring I also came across a couple of clear bugs. They are
fixed in these changes but I will open another PR with unit tests and
fixes to the current version.

Result:

Much faster creation, manipulation, and access; many fewer allocations
and smaller footprint. Benchmark results to follow.
2018-11-03 10:37:07 +01:00
Nick Hill
44cca1a26f Avoid allocations when wrapping byte[] and ByteBuffer arrays as ByteBuf (#8420)
Motivation:

Unpooled.wrap(byte[]...) and Unpooled.wrap(ByteBuffer...) currently
allocate/copy an intermediate ByteBuf ArrayList and array, which can be
avoided.

Modifications:

- Define new internal ByteWrapper interface and add a CompositeByteBuf
constructor which takes a ByteWrapper with an array of the type that it
wraps, and modify the appropriate Unpooled.wrap(...) methods to take
advantage of it
- Tidy up other constructors in CompositeByteBuf to remove duplication
and misleading len arg (which is really an end offset into provided
array)

Result:

Less allocation/copying when wrapping byte[] and ByteBuffer arrays,
tidier code.
2018-10-30 19:35:39 +01:00
root
3e7ddb36c7 [maven-release-plugin] prepare for next development iteration 2018-10-29 15:38:51 +00:00
root
9e50739601 [maven-release-plugin] prepare release netty-4.1.31.Final 2018-10-29 15:37:47 +00:00
Nick Hill
48c45cf4ac Fix leak and corruption bugs in CompositeByteBuf (#8438)
Motivation:

I came across two bugs:
- Components removed due to capacity reduction aren't released
- Offsets aren't set correctly on empty components that are added
between existing components

Modifications:

Add unit tests which expose these bugs, fix them.

Result:

Bugs are fixed
2018-10-28 10:28:18 +01:00
Nick Hill
d7fa7be67f Exploit PlatformDependent.allocateUninitializedArray() in more places (#8393)
Motivation:

There are currently many more places where this could be used which were
possibly not considered when the method was added.

If https://github.com/netty/netty/pull/8388 is included in its current
form, a number of these places could additionally make use of the same
BYTE_ARRAYS threadlocal.

There's also a couple of adjacent places where an optimistically-pooled
heap buffer is used for temp byte storage which could use the
threadlocal too in preference to allocating a temp heap bytebuf wrapper.
For example
https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L1417.

Modifications:

Replace new byte[] with PlatformDependent.allocateUninitializedArray()
where appropriate; make use of ByteBufUtil.getBytes() in some places
which currently perform the equivalent logic, including avoiding copy of
backing array if possible (although would be rare).

Result:

Further potential speed-up with java9+ and appropriate compile flags.
Many of these places could be on latency-sensitive code paths.
2018-10-27 10:43:28 -05:00
Nick Hill
583d838f7c Optimize AbstractByteBuf.getCharSequence() in US_ASCII case (#8392)
* Optimize AbstractByteBuf.getCharSequence() in US_ASCII case

Motivation:

Inspired by https://github.com/netty/netty/pull/8388, I noticed this
simple optimization to avoid char[] allocation (also suggested in a TODO
here).

Modifications:

Return an AsciiString from AbstractByteBuf.getCharSequence() if
requested charset is US_ASCII or ISO_8859_1 (latter thanks to
@Scottmitch's suggestion). Also tweak unit tests not to require Strings
and include a new benchmark to demonstrate the speedup.

Result:

Speed-up of AbstractByteBuf.getCharSequence() in ascii and iso 8859/1
cases
2018-10-26 15:32:38 -07:00
Norman Maurer
87ec2f882a
Reduce overhead by ByteBufUtil.decodeString(...) which is used by AbstractByteBuf.toString(...) and AbstractByteBuf.getCharSequence(...) (#8388)
Motivation:

Our current implementation that is used for toString(Charset) operations on AbstractByteBuf implementation is quite slow as it does a lot of uncessary memory copies. We should just use new String(...) as it has a lot of optimizations to handle these cases.

Modifications:

Rewrite ByteBufUtil.decodeString(...) to use new String(...)

Result:

Less overhead for toString(Charset) operations.

Benchmark                                         (charsetName)  (direct)  (size)   Mode  Cnt         Score         Error  Units
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII     false       8  thrpt   20  22401645.093 ? 4671452.479  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII     false      64  thrpt   20  23678483.384 ? 3749164.446  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII      true       8  thrpt   20  15731142.651 ? 3782931.591  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString          US-ASCII      true      64  thrpt   20  16244232.229 ? 1886259.658  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8     false       8  thrpt   20  25983680.959 ? 5045782.289  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8     false      64  thrpt   20  26235589.339 ? 2867004.950  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8      true       8  thrpt   20  18499027.808 ? 4784684.268  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString             UTF-8      true      64  thrpt   20  16825286.141 ? 1008712.342  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16     false       8  thrpt   20   5789879.092 ? 1201786.359  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16     false      64  thrpt   20   2173243.225 ?  417809.341  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16      true       8  thrpt   20   5035583.011 ? 1001978.854  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString            UTF-16      true      64  thrpt   20   2162345.301 ?  402410.408  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1     false       8  thrpt   20  30039052.376 ? 6539111.622  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1     false      64  thrpt   20  31414163.515 ? 2096710.526  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1      true       8  thrpt   20  19538587.855 ? 4639115.572  ops/s
ByteBufUtilDecodeStringBenchmark.decodeString        ISO-8859-1      true      64  thrpt   20  19467839.722 ? 1672687.213  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII     false       8  thrpt   20  10787326.745 ? 1034197.864  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII     false      64  thrpt   20   7129801.930 ? 1363019.209  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII      true       8  thrpt   20   9002529.605 ? 2017642.445  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld       US-ASCII      true      64  thrpt   20   3860192.352 ?  826218.738  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8     false       8  thrpt   20  10532838.027 ? 2151743.968  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8     false      64  thrpt   20   7185554.597 ? 1387685.785  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8      true       8  thrpt   20   7352253.316 ? 1333823.850  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld          UTF-8      true      64  thrpt   20   2825578.707 ?  349701.156  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16     false       8  thrpt   20   7277446.665 ? 1447034.346  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16     false      64  thrpt   20   2445929.579 ?  562816.641  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16      true       8  thrpt   20   6201174.401 ? 1236137.786  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld         UTF-16      true      64  thrpt   20   2310674.973 ?  525587.959  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1     false       8  thrpt   20  11142625.392 ? 1680556.468  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1     false      64  thrpt   20   8127116.405 ? 1128513.860  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1      true       8  thrpt   20   9405751.952 ? 2193324.806  ops/s
ByteBufUtilDecodeStringBenchmark.decodeStringOld     ISO-8859-1      true      64  thrpt   20   3943282.076 ?  737798.070  ops/s

Benchmark result is saved to /home/norman/mainframer/netty/microbench/target/reports/performance/ByteBufUtilDecodeStringBenchmark.json
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,030.173 sec - in io.netty.buffer.ByteBufUtilDecodeStringBenchmark
[1030.460s][info   ][gc,heap,exit ] Heap
[1030.460s][info   ][gc,heap,exit ]  garbage-first heap   total 516096K, used 257918K [0x0000000609a00000, 0x0000000800000000)
[1030.460s][info   ][gc,heap,exit ]   region size 2048K, 127 young (260096K), 2 survivors (4096K)
[1030.460s][info   ][gc,heap,exit ]  Metaspace       used 17123K, capacity 17438K, committed 17792K, reserved 1064960K
[1030.460s][info   ][gc,heap,exit ]   class space    used 1709K, capacity 1827K, committed 1920K, reserved 1048576K
2018-10-19 14:00:13 +02:00
Norman Maurer
69545aedc4
CompositeByteBuf.decompose(...) does not correctly slice content. (#8403)
Motivation:

CompositeByteBuf.decompose(...) did not correctly slice the content and so produced an incorrect representation of the data.

Modifications:

- Rewrote implementation to fix bug and also improved it to reduce GC
- Add unit tests.

Result:

Fixes https://github.com/netty/netty/issues/8400.
2018-10-19 08:05:22 +02:00
Nick Hill
7062ceedb0 Simplify ByteBufInputStream.readLine() logic (#8380)
Motivation:

While looking at the nice optimization done in
https://github.com/netty/netty/pull/8347 I couldn't help noticing the
logic could be simplified further. Apologies if this is just my OCD and
inappropriate!

Modifications:

Reduce amount of code used for ByteBufInputStream.readLine()

Result:

Slightly smaller and simpler code
2018-10-13 06:24:40 +02:00