440 Commits

Author SHA1 Message Date
Trustin Lee
fb5583d788 Refactoring in preparation to unify I/O logic for all branches
Motivation:

While trying to merge our ChannelOutboundBuffer changes we've made last
week, I realized that we have quite a bit of conflicting changes at 4.1
and master. It was primarily because we added
ChannelOutboundBuffer.beforeAdd() and moved some logic there, such as
direct buffer conversion.

However, this is not possible with the changes we've made for 4.0. We
made ChannelOutboundBuffer final for example.

Maintaining multiple branch is already getting painful and having
different core will make it even worse, so I think we should keep the
differences between 4.0 and other branches minimal.

Modifications:

- Move ChannelOutboundBuffer.safeRelease() to ReferenceCountUtil
- Add ByteBufUtil.threadLocalBuffer()
  - Backported from ThreadLocalPooledDirectByteBuf
- Make most methods in AbstractUnsafe final
  - Add AbstractChannel.filterOutboundMessage() so that a transport can
    convert a message to another (e.g. heap -> off-heap), and also
    reject unsupported messages
  - Move all direct buffer conversions to filterOutboundMessage()
  - Move all type checks to filterOutboundMessage()
- Move AbstractChannel.checkEOF() to OioByteStreamChannel, because it's
  the only place it is used at all
- Remove ChannelOutboundBuffer.current(Object), because it's not used
  anymore
- Add protected direct buffer conversion methods to AbstractNioChannel
  and AbstractEpollChannel so that they can be used by their subtypes
- Update all transport implementations according to the changes above

Result:

- The missing extension point in 4.0 has been added.
  - AbstractChannel.filterOutboundMessage()
  - Thanks to the new extension point, we moved all transport-specific
    logic from ChannelOutboundBuffer to each transport implementation
- We can copy most of the transport implementations in 4.0 to 4.1 and
  master now, so that we have much less merge conflict when we modify
  the core.
2014-08-05 08:04:23 +02:00
Trustin Lee
07801d7b38 Remove duplicate range check in AbstractByteBuf.skipBytes() 2014-07-29 15:58:38 -07:00
Idel Pivnitskiy
01b11ca2cb Small performance improvements
Modifications:

- Added a static modifier for CompositeByteBuf.Component.
This class is an inner class, but does not use its embedded reference to the object which created it. This reference makes the instances of the class larger, and may keep the reference to the creator object alive longer than necessary.
A boxed primitive is created from a String, just to extract the unboxed primitive value.
- Removed unnecessary checks if file exists before call mkdirs() in NativeLibraryLoader and PlatformDependent.
Because the method mkdirs() has this check inside.

Conflicts:
	codec-http/src/main/java/io/netty/handler/codec/http/multipart/DiskAttribute.java
	codec-stomp/src/main/java/io/netty/handler/codec/stomp/StompSubframeAggregator.java
	codec-stomp/src/main/java/io/netty/handler/codec/stomp/StompSubframeDecoder.java
2014-07-20 09:29:33 +02:00
Norman Maurer
c4d0a87e19 [#2653] Remove unnecessary ensureAccessible() calls
Motivation:

I introduced ensureAccessible() class as part of 6c47cc97111146396d2daf1a97051135d2eaf69e in some places. Unfortunally I also added some where these are not needed and so caused a performance regression.

Modification:

Remove calls where not needed.

Result:

Fixed performance regression.
2014-07-14 21:02:07 +02:00
Norman Maurer
ccd88596f3 [#2653] Remove uncessary range checks for performance reasons
Motivation:

I introduced range checks as part of 6c47cc97111146396d2daf1a97051135d2eaf69e in some places. Unfortunally I also added some where these are not needed and so caused a performance regression.

Modification:

Remove range checks where not needed

Result:

Fixed performance regression.
2014-07-14 11:40:27 +02:00
Brendt Lucas
5061be1b03 [#2642] CompositeByteBuf.deallocate memory/GC improvement
Motivation:

CompositeByteBuf.deallocate generates unnecessary GC pressure when using the 'foreach' loop, as a 'foreach' loop creates an iterator when looping.

Modification:

Convert 'foreach' loop into regular 'for' loop.

Result:

Less GC pressure (and possibly more throughput) as the 'for' loop does not create an iterator
2014-07-08 21:08:34 +02:00
Trustin Lee
3d8d843198 Fix the build timeout when 'leak' profile is active
Motivation:

AbstractByteBufTest.testInternalBuffer() uses writeByte() operations to
populate the sample data.  Usually, this isn't a problem, but it starts
to take a lot of time when the resource leak detection level gets
higher.

In our CI machine, testInternalBuffer() takes more than 30 minutes,
causing the build timeout when the 'leak' profile is active (paranoid
level resource detection.)

Modification:

Populate the sample data using ThreadLocalRandom.nextBytes() instead of
using millions of writeByte() operations.

Result:

Test runs much faster when leak detection level is high.
2014-07-03 17:55:18 +09:00
Trustin Lee
0a8ff3b52d Fix most inspector warnings
Motivation:

It's good to minimize potentially broken windows.

Modifications:

Fix most inspector warnings from our profile
Update IntObjectHashMap

Result:

Cleaner code
2014-07-02 20:21:30 +09:00
Norman Maurer
e8f4def2a3 [maven-release-plugin] prepare for next development iteration 2014-06-30 14:31:08 +02:00
Norman Maurer
25e3c8ce3d [maven-release-plugin] prepare release netty-4.0.21.Final 2014-06-30 14:29:15 +02:00
Norman Maurer
6c47cc9711 [#2622] Correctly check reference count before try to work on the underlying memory
Motivation:

Because of how we use reference counting we need to check for the reference count before each operation that touches the underlying memory. This is especially true as we use sun.misc.Cleaner.clean() to release the memory ASAP when possible. Because of this the user may cause a SEGFAULT if an operation is called that tries to access the backing memory after it was released.

Modification:

Correctly check the reference count on all methods that access the underlying memory or expose it via a ByteBuffer.

Result:

Safer usage of ByteBuf
2014-06-30 07:10:12 +02:00
Trustin Lee
d8d0bbfc26 Optimize PoolChunk
- Using short[] for memoryMap did not improve performance. Reverting
  back to the original dual-byte[] structure in favor of simplicity.
- Optimize allocateRun() which yields small performence improvement
- Use local variable when member fields are accessed more than once
2014-06-26 17:06:29 +09:00
Trustin Lee
c41538050c Fix inspector warnings 2014-06-26 17:06:29 +09:00
Pavan Kumar
d2e36a49c7 Improve the allocation algorithm in PoolChunk
Motivation:

Depth-first search is not always efficient for buddy allocation.

Modification:

Employ a new faster search algorithm with different memoryMap layout.

Result:

With thread-local cache disabled, we see a lot of performance
improvment, especially when the size of the allocation is as small as
the page size, which had the largest search space previously.
2014-06-26 17:06:29 +09:00
Trustin Lee
4a13f66e13 Remove 'get' prefix from all HTTP/SPDY messages
Motivation:

Persuit for the consistency in method naming

Modifications:

- Remove the 'get' prefix from all HTTP/SPDY message classes
- Fix some inspector warnings

Result:

Consistency
Fixes #2594
2014-06-24 18:33:30 +09:00
Norman Maurer
22f16e52bf MessageToByteEncoder always starts with ByteBuf that use initalCapacity == 0
Motivation:

MessageToByteEncoder always starts with ByteBuf that use initalCapacity == 0 when preferDirect is used. This is really wasteful in terms of performance as every first write into the buffer will cause an expand of the buffer itself.

Modifications:

 - Change ByteBufAllocator.ioBuffer() use the same default initialCapacity as heapBuffer() and directBuffer()
 - Add new allocateBuffer method to MessageToByteEncoder that allow the user to do some smarter allocation based on the message that will be encoded.

Result:

Less expanding of buffer and more flexibilty when allocate the buffer for encoding.
2014-06-24 13:55:02 +09:00
Trustin Lee
7162d96ed5 Revert "Improve the allocation algorithm in PoolChunk"
This reverts commit 36305d7dcee60bac9d353ba12e044c260435da57, which
seems to cause an assertion failure on our CI machine.
2014-06-21 19:19:49 +09:00
Pavan Kumar
004ffbad90 Improve the allocation algorithm in PoolChunk
Motivation:

Depth-first search is not always efficient for buddy allocation.

Modification:

Employ a new faster search algorithm with different memoryMap layout.

Result:

With thread-local cache disabled, we see a lot of performance
improvment, especially when the size of the allocation is as small as
the page size, which had the largest search space previously:

-- master head --
Benchmark                (size) Mode    Score  Error Units
pooledDirectAllocAndFree  8192 thrpt  215.392  1.565 ops/ms
pooledDirectAllocAndFree 16384 thrpt  594.625  2.154 ops/ms
pooledDirectAllocAndFree 65536 thrpt 1221.520 18.965 ops/ms
pooledHeapAllocAndFree    8192 thrpt  217.175  1.653 ops/ms
pooledHeapAllocAndFree   16384 thrpt  587.250 14.827 ops/ms
pooledHeapAllocAndFree   65536 thrpt 1217.023 44.963 ops/ms

-- changes --
Benchmark                (size) Mode    Score  Error Units
pooledDirectAllocAndFree  8192 thrpt 3656.744 94.093 ops/ms
pooledDirectAllocAndFree 16384 thrpt 4087.152 22.921 ops/ms
pooledDirectAllocAndFree 65536 thrpt 4058.814 29.276 ops/ms
pooledHeapAllocAndFree    8192 thrpt 3640.355 44.418 ops/ms
pooledHeapAllocAndFree   16384 thrpt 4030.206 24.365 ops/ms
pooledHeapAllocAndFree   65536 thrpt 4103.991 70.991 ops/ms
2014-06-21 13:20:56 +09:00
Norman Maurer
19a1b603d0 Remove System.out.println(...) debug messages 2014-06-20 19:42:08 +02:00
Norman Maurer
d2b8560a76 [#2580] [#2587] Fix buffer corruption regression when ByteBuf.order(LITTLE_ENDIAN) is used
Motivation:

To improve the speed of ByteBuf with order LITTLE_ENDIAN and where the native order is also LITTLE_ENDIAN (intel) we introduces a new special SwappedByteBuf before in commit 4ad3984c8b725ef59856d174d09d1209d65933fc. Unfortunally the commit has a flaw which does not handle correctly the case when a ByteBuf expands. This was caused because the memoryAddress was cached and never changed again even if the underlying buffer expanded. This can lead to corrupt data or even to SEGFAULT the JVM if you are lucky enough.

Modification:

Always lookup the actual memoryAddress of the wrapped ByteBuf.

Result:

No more data-corruption for ByteBuf with order LITTLE_ENDIAN and no JVM crashes.
2014-06-20 18:25:54 +02:00
Trustin Lee
fb538ea532 Refactor FastThreadLocal to simplify TLV management
Motivation:

When Netty runs in a managed environment such as web application server,
Netty needs to provide an explicit way to remove the thread-local
variables it created to prevent class loader leaks.

FastThreadLocal uses different execution paths for storing a
thread-local variable depending on the type of the current thread.
It increases the complexity of thread-local removal.

Modifications:

- Moved FastThreadLocal and FastThreadLocalThread out of the internal
  package so that a user can use it.
- FastThreadLocal now keeps track of all thread local variables it has
  initialized, and calling FastThreadLocal.removeAll() will remove all
  thread-local variables of the caller thread.
- Added FastThreadLocal.size() for diagnostics and tests
- Introduce InternalThreadLocalMap which is a mixture of hard-wired
  thread local variable fields and extensible indexed variables
- FastThreadLocal now uses InternalThreadLocalMap to implement a
  thread-local variable.
- Added ThreadDeathWatcher.unwatch() so that PooledByteBufAllocator
  tells it to stop watching when its thread-local cache has been freed
  by FastThreadLocal.removeAll().
- Added FastThreadLocalTest to ensure that removeAll() works
- Added microbenchmark for FastThreadLocal and JDK ThreadLocal
- Upgraded to JMH 0.9

Result:

- A user can remove all thread-local variables Netty created, as long as
  he or she did not exit from the current thread. (Note that there's no
  way to remove a thread-local variable from outside of the thread.)
- FastThreadLocal exposes more useful operations such as isSet() because
  we always implement a thread local variable via InternalThreadLocalMap
  instead of falling back to JDK ThreadLocal.
- FastThreadLocalBenchmark shows that this change improves the
  performance of FastThreadLocal even more.
2014-06-19 21:08:16 +09:00
Norman Maurer
6fdf1138ca [#2573] UnpooledUnsafeDirectByteBuf.setBytes(int,ByteBuf,int,int) fails to use fast-path when src has array
Motivation:

UnpooledUnsafeDirectByteBuf.setBytes(int,ByteBuf,int,int) fails to use fast-path when src uses an array as backing storage. This is because the if else uses the wrong ByteBuf for its check.

Modifications:

- Use correct ByteBuf when check for array as backing storage
- Also eliminate unecessary check in UnpooledDirectByteBuf which always fails anyway

Result:

Faster setBytes(...) when src ByteBuf is backed by an array.

No more IndexOutOfBoundsException or data-corruption.
2014-06-16 11:11:54 +02:00
Norman Maurer
b737d631f1 [maven-release-plugin] prepare for next development iteration 2014-06-12 16:20:52 +02:00
Norman Maurer
1709113a1f [maven-release-plugin] prepare release netty-4.0.20.Final 2014-06-12 16:14:48 +02:00
Norman Maurer
76043bc8c8 Make use of an array to store FastThreadLocals and so allow to also use it in PooledByteBufAllocator that is instanced by users.
Motivation:
Allow to make use of our new FastThreadLocal whereever possible

Modification:
Make use of an array to store FastThreadLocals and so allow to also use it in PooledByteBufAllocator that is instanced by users.
The maximal size of the array is configurable per system property to allow to tune it if needed. As default we use 64 entries which should be good enough.

Result:
More flexible usage of FastThreadLocal
2014-06-12 15:43:20 +02:00
belliottsmith
1ac2ff8d7b Introduce FastThreadLocal which uses an EnumMap and a predefined fixed set of possible thread locals
Motivation:
Provide a faster ThreadLocal implementation

Modification:
Add a "FastThreadLocal" which uses an EnumMap and a predefined fixed set of possible thread locals (all of the static instances created by netty) that is around 10-20% faster than standard ThreadLocal in my benchmarks (and can be seen having an effect in the direct PooledByteBufAllocator benchmark that uses the DEFAULT ByteBufAllocator which uses this FastThreadLocal, as opposed to normal instantiations that do not, and in the new RecyclableArrayList benchmark);

Result:
Improved performance
2014-06-12 15:43:20 +02:00
Norman Maurer
4ad3984c8b [#2436] Unsafe*ByteBuf implementation should only invert bytes if ByteOrder differ from native ByteOrder
Motivation:
Our Unsafe*ByteBuf implementation always invert bytes when the native ByteOrder is LITTLE_ENDIAN (this is true on intel), even when the user calls order(ByteOrder.LITTLE_ENDIAN). This is not optimal for performance reasons as the user should be able to set the ByteOrder to LITTLE_ENDIAN and so write bytes without the extra inverting.

Modification:
- Introduce a new special SwappedByteBuf (called UnsafeDirectSwappedByteBuf) that is used by all the Unsafe*ByteBuf implementation and allows to write without inverting the bytes.
- Add benchmark
- Upgrade jmh to 0.8

Result:
The user is be able to get the max performance even on servers that have ByteOrder.LITTLE_ENDIAN as their native ByteOrder.
2014-06-05 11:09:58 +02:00
Trustin Lee
0928e28385 Use Java 5 foreach for arrays for brevity at no cost 2014-06-02 18:25:42 +09:00
Trustin Lee
ed6df98653 Introduce ThreadDeathWatcher
Motivation:

PooledByteBufAllocator's thread local cache and
ReferenceCountUtil.releaseLater() are in need of a way to run an
arbitrary logic when a certain thread is terminated.

Modifications:

- Add ThreadDeathWatcher, which spawns a low-priority daemon thread
  that watches a list of threads periodically (every second) and
  invokes the specified tasks when the associated threads are not alive
  anymore
  - Start-stop logic based on CAS operation proposed by @tea-dragon
- Add debug-level log messages to see if ThreadDeathWatcher works

Result:

- Fixes #2519 because we don't use GlobalEventExecutor anymore
- Cleaner code
2014-06-02 18:23:48 +09:00
Trustin Lee
350ac9787e Do not use a pseudo random for tree traversal
Motivation:

If we make allocateRun/SubpageSimple() always try the left node first and make allocateRun/Subpage() always tries the right node first,  it is more likely that allocateRun/Subpage() will find a node with ST_UNUSED sooner.

Modifications:

- Make allocateRunSimple() and allocateSubpageSimple() always try the left node first.
- Make allocateRun() and allocateSubpage() always try the right node first.
- Remove randome

Result:

We get the same performance without using random numbers.
2014-05-30 11:24:22 +09:00
Trustin Lee
3e7dbe072e Optimize PooledByteBufAllocator
Motivation:

We still have a room for improvement in PoolChunk.allocateRun() and
Subpage.allocate().

Modifications:

- Unroll the recursion in PoolChunk.allocateRun()
- Subpage.allocate() makes use of the 'nextAvail' value set by previous
  free().

Result:

- PoolChunk.allocateRun() optimization yields 10%+ improvements in
  allocation throughput for non-subpage allocations.
- Subpage.allocate() optimization makes the subpage allocations for
  tiny buffers as fast as non-tiny buffers even when the pageSize is
  huge (e.g. 1048576) because it doesn't need to perform a linear search
  in most cases.
2014-05-30 10:51:27 +09:00
Jake Luciani
795507aa7b Fix capacity check bug affecting offheap buffers 2014-05-13 07:24:56 +02:00
Norman Maurer
a597087a9f [maven-release-plugin] prepare for next development iteration 2014-04-30 15:40:54 +02:00
Norman Maurer
b562148e2d [maven-release-plugin] prepare release netty-4.0.19.Final 2014-04-30 15:40:31 +02:00
ian
fde13d96f9 Fix error that causes (up to) double memory usage
Motivation:

PoolArena's 'normalizeCapacity' function was micro-optimized some
time ago to remove a while loop. However, there was a change of
behavior in the function as a result. Capacities passed into it
that are already powers of 2 (and >= 512) are doubled in size. So
if I ask for a buffer with a capacity of 1024, I will get back one
that actually uses 2048 bytes (stored in maxLength).

Aligning to powers of two for book keeping ease is reasonable,
and if someone tries to expand a buffer, you might as well use some
of the previously wasted space. However, since this distinction
between 'easily expanded' and 'costly to expand' space is not
supported at all by the APIs, I cannot imagine this change to
doubling is desirable or intentional.

This is especially costly when using composite buffers. They
frequently allocate components with a capacity that is a power of
2, and they never attempt to expand components themselves. The end
result is that heavy use of pool-backed composite buffers wastes
almost half of the memory pool (the smaller / initial components are
<512 and so are not affected by the off-by-one bug).

Modifications:

Although I find it difficult to believe that such an optimization
is really helpful, I left it in and fixed the off-by-one issue by
decrementing the value at the start.

I also added a simple test to both attempt to verify that the
decrement fixes the issue without introducing any other change, and
to make it easy for a reviewer to test the existing behavior. PoolArena
does not seem to have much testing or testability support though so
the test is kind of a hack and will break for unrelated changes. I
suggest either removing it or factoring out the single non-static
portion of normalizeCapacity so that the fragile dummy PoolArena is
not required.

Result:

Pooled allocators will allocate less resources to the highly
inefficient and undocumented buffer section between length and
maxLength.

Composite buffers of non-trivial size that are backed by pooled
allocators will use about half as much memory.
2014-04-15 07:02:49 +02:00
Norman Maurer
9c934116f2 [#2370] Periodically check for not alive Threads and free up their ThreadPoolCache
Motivation:
At the moment we create new ThreadPoolCache whenever a Thread tries either allocate or release something on the PooledByteBufAllocator. When something is released we put it then in its ThreadPoolCache. The problem is we never check if a Thread is not alive anymore and so we may end up with memory that is never freed again if a user create many short living Threads that use the PooledByteBufAllocator.

Modifications:
Periodically check if the Thread is still alive that has a ThreadPoolCache assinged and if not free it.

Result:
Memory is freed up correctly even for short living Threads.
2014-04-09 11:44:51 +02:00
Norman Maurer
816165c96a [maven-release-plugin] prepare for next development iteration 2014-04-01 07:21:40 +02:00
Norman Maurer
1512a4dcca [maven-release-plugin] prepare release netty-4.0.18.Final 2014-04-01 07:20:16 +02:00
Norman Maurer
13fd69e871 Implement Thread caches for pooled buffers to minimize conditions. This fixes [#2264] and [#808].
Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
2014-03-20 09:18:04 -07:00
Jakob Buchgraber
17ba35b6d0 Bit tricks to check for and calculate power of two.
Motivation:
I was studying the code and thought this was simpler and easier to
understand.

Modifications:
Replaced the for loop and if conditions, with a simple implementation.

Result:
Code is easier to understand.
2014-03-18 15:59:58 +09:00
Bourne, Geoff
1c074eabe5 Fix limit computation of NIO ByteBuffers obtained via ReadOnlyByteBufferBuf.nioBuffer
Motivation:

When starting with a read-only NIO buffer, wrapping it in a ByteBuf,
and then later retrieving a re-wrapped NIO buffer the limit was getting
too short.

Modifications:

Changed ReadOnlyByteBufferBuf.nioBuffer(int,int) to compute the
limit in the same manner as the internalNioBuffer method.

Result:

Round-trip conversion from NIO to ByteBuf to NIO will work reliably.
2014-03-14 08:07:19 +01:00
Norman Maurer
ccd135df01 [maven-release-plugin] prepare for next development iteration 2014-02-24 15:39:26 +01:00
Norman Maurer
33587eb183 [maven-release-plugin] prepare release netty-4.0.17.Final 2014-02-24 15:37:31 +01:00
Trustin Lee
0fc66a411f The default buffer must be unpooled for backward compatibility
Mistakenly set to pooled while merging the changes from 4.1 and master.
2014-02-21 14:43:07 -08:00
Norman Maurer
66e2bb1e75 [maven-release-plugin] prepare for next development iteration 2014-02-19 03:41:24 +01:00
Norman Maurer
c466bb803d [maven-release-plugin] prepare release netty-4.0.16.Final 2014-02-19 03:36:54 +01:00
Trustin Lee
b18c8fe688 Determine the default allocator from system property
- Add ByteBufAllocator.DEFAULT
- The default allocator is 'unpooled'
2014-02-14 13:05:57 -08:00
Norman Maurer
f23d68b42f [#2187] Always do a volatile read on the refCnt 2014-02-07 09:23:16 +01:00
Norman Maurer
9bee78f91c Provide an optimized AtomicIntegerFieldUpdater, AtomicLongFieldUpdater and AtomicReferenceFieldUpdater 2014-02-06 20:08:45 +01:00
Norman Maurer
d67184b488 [maven-release-plugin] prepare for next development iteration 2014-01-21 08:18:32 +01:00