Motivation:
Provide a faster ThreadLocal implementation
Modification:
Add a "FastThreadLocal" which uses an EnumMap and a predefined fixed set of possible thread locals (all of the static instances created by netty) that is around 10-20% faster than standard ThreadLocal in my benchmarks (and can be seen having an effect in the direct PooledByteBufAllocator benchmark that uses the DEFAULT ByteBufAllocator which uses this FastThreadLocal, as opposed to normal instantiations that do not, and in the new RecyclableArrayList benchmark);
Result:
Improved performance
Motivation:
Our Unsafe*ByteBuf implementation always invert bytes when the native ByteOrder is LITTLE_ENDIAN (this is true on intel), even when the user calls order(ByteOrder.LITTLE_ENDIAN). This is not optimal for performance reasons as the user should be able to set the ByteOrder to LITTLE_ENDIAN and so write bytes without the extra inverting.
Modification:
- Introduce a new special SwappedByteBuf (called UnsafeDirectSwappedByteBuf) that is used by all the Unsafe*ByteBuf implementation and allows to write without inverting the bytes.
- Add benchmark
- Upgrade jmh to 0.8
Result:
The user is be able to get the max performance even on servers that have ByteOrder.LITTLE_ENDIAN as their native ByteOrder.
Motivation:
Allocating a single buffer and releasing it repetitively for a benchmark will not involve the realistic execution path of the allocators.
Modifications:
Keep the last 8192 allocations and release them randomly.
Result:
We are now getting the result close to what we got with caliper.
- Rename directbyDefault to preferDirect
- Add a system property 'io.netty.prederDirect' to allow a user from changing the preference on launch-time
- Merge UnpooledByteBufAllocator.DEFAULT_BY_* to DEFAULT
- Fixes#826
Unsafe.isFreed(), free(), suspend/resumeIntermediaryAllocations() are not that dangerous. internalNioBuffer() and internalNioBuffers() are dangerous but it seems like nobody is using it even inside Netty. Removing those two methods also removes the necessity to keep Unsafe interface at all.
- Add common optimization options when launching a new JVM to run a benchmark
- Fix a bug where a benchmark report is uploaded twice
- Simplify pom.xml and move the build instruction messages to DefaultBenchmark
- Print an empty line to prettify the output
This pull request introduces the new default ByteBufAllocator implementation based on jemalloc, with a some differences:
* Minimum possible buffer capacity is 16 (jemalloc: 2)
* Uses binary heap with random branching (jemalloc: red-black tree)
* No thread-local cache yet (jemalloc has thread-local cache)
* Default page size is 8 KiB (jemalloc: 4 KiB)
* Default chunk size is 16 MiB (jemalloc: 2 MiB)
* Cannot allocate a buffer bigger than the chunk size (jemalloc: possible) because we don't have control over memory layout in Java. A user can work around this issue by creating a composite buffer, but it's not always a feasible option. Although 16 MiB is a pretty big default, a user's handler might need to deal with the bounded buffers when the user wants to deal with a large message.
Also, to ensure the new allocator performs good enough, I wrote a microbenchmark for it and made it a dedicated Maven module. It uses Google's Caliper framework to run and publish the test result (example)
Miscellaneous changes:
* Made some ByteBuf implementations public so that those who implements a new allocator can make use of them.
* Added ByteBufAllocator.compositeBuffer() and its variants.
* ByteBufAllocator.ioBuffer() creates a buffer with 0 capacity.