Motivation:
While benchmarking the native transport with gathering writes I noticed that it is quite slow. This is due the fact that we need to do a lot of array copies to get the buffers into the iov array.
Modification:
Introduce a new class calles IovArray which allows to fill buffers directly in a iov array that can be passed over to JNI without any array copies. This gives a nice optimization in terms of speed when doing gathering writes.
Result:
Big performance improvement when doing gathering writes. See the included benchmark...
Before:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 23.44ms 16.37ms 259.57ms 91.77%
Req/Sec 181.99k 31.69k 304.60k 78.12%
346544071 requests in 2.00m, 46.48GB read
Requests/sec: 2887885.09
Transfer/sec: 396.59MB
With this change:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.93ms 16.33ms 305.73ms 92.34%
Req/Sec 194.56k 33.75k 309.33k 77.04%
369617503 requests in 2.00m, 49.57GB read
Requests/sec: 3080169.65
Transfer/sec: 423.00MB
Motivation:
Due some race-condition while handling canellation of TimerTasks it was possibleto corrupt the linked-list structure that is represent by HashedWheelBucket and so produce a NPE.
Modification:
Fix the problem by adding another MpscLinkedQueue which holds the cancellation tasks and process them on each tick. This allows to use no synchronization / locking at all while introduce a latency of max 1 tick before the TimerTask can be GC'ed.
Result:
No more NPE
Motivation:
In ReplayingDecoder / ByteToMessageDecoder channelInactive(...) method we try to decode a last time and fire all decoded messages throw the pipeline before call ctx.fireChannelInactive(...). To keep the correct order of events we also need to call ctx.fireChannelReadComplete() if we read anything.
Modifications:
- Channel channelInactive(...) to call ctx.fireChannelReadComplete() if something was decoded
- Move out.recycle() to finally block
Result:
Correct order of events.
Related issue: #2688
- DnsClass and DnsType
- Make DnsClass and DnsType implement Comparable
- Optimize the message generation of IllegalArgumentException,
by pre-populating the list of the expected parameters
- Move the static methods up
- Relax the validation rule of DnsClass so that a user can define a
CLASS which is not listed in the RFC 1035
- valueOf(int) does not throw IllegalArgumentException anymore as long
as the specified value is an unsigned short.
- Rename create() and forName() to valueOf() so that they look like a
real enum
- Rename type() and clazz() to intValue() so that they conform to our
naming convention
- Add missing null checks in DnsEntry
Motivation:
DNS class and type were represented as integers rather than an enum or a
similar dedicated value type. This can be a potential source of a
parameter order bug which might be difficult to track down.
Modifications:
Add DnsClass and DnsType to replace integer parameters
Result:
Type safety and less error-proneness
Motivation:
Complicated code of Bzip2 tests with some unnecessary actions.
Modifications:
- Reduce size of BYTES_LARGE array of random test data for Bzip2 tests.
- Removed unnecessary creations of EmbeddedChannel instances in Bzip2 tests.
- Simplified tests in Bzip2DecoderTest which expect exception.
- Removed unnecessary testStreamInitialization() from Bzip2EncoderTest.
Result:
Reduced time to test the 'codec' package by 7 percent, simplified code of Bzip2 tests.
Motivation:
Duplicated code of integration tests for different compression codecs.
Modifications:
- Added abstract class IntegrationTest which contains common tests for any compression codec.
- Removed common tests from Bzip2IntegrationTest and LzfIntegrationTest.
- Implemented abstract methods of IntegrationTest in Bzip2IntegrationTest, LzfIntegrationTest and SnappyIntegrationTest.
Result:
Removed duplicated code of integration tests for compression codecs and simplified an addition of integration tests for new compression codecs.
Motivation:
ChannelOutboundBuffer is basically a circular array queue of its entry
objects. Once an entry is created in the array, it is never nulled out
to reduce the allocation cost.
However, because it is a circular queue, the array almost always ends up
with as many entry instances as the size of the array, regardless of the
number of pending writes.
At worst case, a channel might have only 1 pending writes at maximum
while creating 32 entry objects, where 32 is the initial capacity of the
array.
Modifications:
- Reduce the initial capacity of the circular array queue to 4.
- Make the initial capacity of the circular array queue configurable
Result:
We spend 4 times less memory for entry objects under certain
circumstances.
Motivation:
Sometimes we have a 'build time out' error because tests for bzip2 codec take a long time.
Modifications:
Removed cycles from Bzip2EncoderTest.testCompression(byte[]) and Bzip2DecoderTest.testDecompression(byte[]).
Result:
Reduced time to test the 'codec' package by 30 percent.
Motivation:
When decoding the NAME field in a DNS Resource Record, DnsResponseDecoder
can raise a NullPointerException if the NAME field contains a loop.
Modification:
Instead of raising an NPE, raise CorruptedFrameException so that the
exception itself has meaning.
Result:
Less confusing when a malformed DNS RR is received
Motivation:
NullPointerException is raised when a DNS response conrains a resource
record whose NAME is empty, which is the case for the authority section.
Modification:
Allow an empty name for DnsEntry. Only disallow an empty name for
DnsQuestion.
Result:
Fixes#2686
- Rewrite with linear probing, no state array, compaction at cleanup
- Optimize keys() and values() to not use reflection
- Optimize hashCode() and equals() for efficient iteration
- Fixed equals() to not return true for equals(null)
- Optimize iterator to not allocate new Entry at each next()
- Added toString()
- Added some new unit tests
Motivation:
Message from FindBugs:
This method performs synchronization an object that is an instance of a class from the java.util.concurrent package (or its subclasses). Instances of these classes have their own concurrency control mechanisms that are orthogonal to the synchronization provided by the Java keyword synchronized. For example, synchronizing on an AtomicBoolean will not prevent other threads from modifying the AtomicBoolean.
Such code may be correct, but should be carefully reviewed and documented, and may confuse people who have to maintain the code at a later date.
Modification:
Use synchronized(this)
Result:
Less confusing code
Motivation:
At the moment we use Get*ArrayElement all the time in the epoll transport which may be wasteful as the JVM may do a memory copy for this. For code-path that will get executed fast (without blocking) we should better make use of GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical as this signal the JVM that we not want to do any memory copy if not really needed. It is important to only do this on non-blocking code-path as this may even suspend the GC to disallow the JVM to move the arrays around.
See also http://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#GetPrimitiveArrayCritical
Modification:
Make use of GetPrimitiveArrayCritical / ReleasePrimitiveArrayCritical as replacement for Get*ArrayElement / Release*ArrayElement where possible.
Result:
Better performance due less memory copies.
Motivation:
In EpollSocketchannel.writeBytesMultiple(...) we loop over all buffers to see if we need to adjust the readerIndex for incomplete writes. We can skip this if we know that everything was written (a.k.a complete write).
Modification:
Use fast-path if all bytes are written and so no need to loop over buffers
Result:
Fast write path for the average use.
Motivation:
At the moment NioSocketChannelOutboundBuffer.nioBuffers() / EpollSocketChannelOutboundBuffer.memoryAddresses() returns null if something is contained in the ChannelOutboundBuffer which is not a ByteBuf. This is a problem for two reasons:
1 - In the javadocs we state that it will never return null
2 - We may do a not optimal write as there may be things that could be written via gathering writes
Modifications:
Change NioSocketChannelOutboundBuffer.nioBuffers() / EpollSocketChannelOutboundBuffer.memoryAddresses() to never return null but have it contain all ByteBuffer that were found before the non ByteBuf. This way we can do a gathering write and also conform to the javadocs.
Result:
Better speed and also correct implementation in terms of the api.
Motivation:
DNS packets that contain pointers in a loop will cause
DnsResponseDecoder.readName() to infinite loop.
Modifications:
Fixed DnsResponseDecoder.readName() to detect when packets have loops
and return null instead.
Result:
It is no longer possible to cause Netty to infinite loop by sending it malformed
DNS packets with a loop.
Motivation:
Now Netty has a few problems with null values.
Modifications:
- Check HAProxyProxiedProtocol in HAProxyMessage constructor and throw NPE if it is null.
If HAProxyProxiedProtocol is null we will set AddressFamily as null. So we will get NPE inside checkAddress(String, AddressFamily) and it won't be easy to understand why addrFamily is null.
- Check File in DiskFileUpload.toString().
If File is null we will get NPE when calling toString() method.
- Check Result<String> in MqttDecoder.decodeConnectionPayload(...).
If !mqttConnectVariableHeader.isWillFlag() || !mqttConnectVariableHeader.hasUserName() || !mqttConnectVariableHeader.hasPassword() we will get NPE when we will try to create new instance of MqttConnectPayload.
- Check Unsafe before calling unsafe.getClass() in PlatformDependent0 static block.
- Removed unnecessary null check in WebSocket08FrameEncoder.encode(...).
Because msg.content() can not return null.
- Removed unnecessary null check in DefaultStompFrame(StompCommand) constructor.
Because we have this check in the super class.
- Removed unnecessary null checks in ConcurrentHashMapV8.removeTreeNode(TreeNode<K,V>).
- Removed unnecessary null check in OioDatagramChannel.doReadMessages(List<Object>).
Because tmpPacket.getSocketAddress() always returns new SocketAddress instance.
- Removed unnecessary null check in OioServerSocketChannel.doReadMessages(List<Object>).
Because socket.accept() always returns new Socket instance.
- Pass Unpooled.buffer(0) instead of null inside CloseWebSocketFrame(boolean, int) constructor.
If we will pass null we will get NPE in super class constructor.
- Added throw new IllegalStateException in GlobalEventExecutor.awaitInactivity(long, TimeUnit) if it will be called before GlobalEventExecutor.execute(Runnable).
Because now we will get NPE. IllegalStateException will be better in this case.
- Fixed null check in OpenSslServerContext.setTicketKeys(byte[]).
Now we throw new NPE if byte[] is not null.
Result:
Added new null checks when it is necessary, removed unnecessary null checks and fixed some NPE problems.
Motivation:
Fix some typos in Netty.
Modifications:
- Fix potentially dangerous use of non-short-circuit logic in Recycler.transfer(Stack<?>).
- Removed double 'the the' in javadoc of EmbeddedChannel.
- Write to log an exception message if we can not get SOMAXCONN in the NetUtil's static block.
Motivation:
Fixed founded mistakes in compression codecs.
Modifications:
- Changed return type of ZlibUtil.inflaterException() from CompressionException to DecompressionException
- Updated @throws in javadoc of JZlibDecoder to throw DecompressionException instead of CompressionException
- Fixed JdkZlibDecoder to throw DecompressionException instead of CompressionException
- Removed unnecessary empty lines in JdkZlibEncoder and JZlibEncoder
- Removed public modifier from Snappy class
- Added MAX_UNCOMPRESSED_DATA_SIZE constant in SnappyFramedDecoder
- Used in.readableBytes() instead of (in.writerIndex() - in.readerIndex()) in SnappyFramedDecoder
- Added private modifier for enum ChunkType in SnappyFramedDecoder
- Fixed potential bug (sum overflow) in Bzip2HuffmanAllocator.first(). For more info, see http://googleresearch.blogspot.ru/2006/06/extra-extra-read-all-about-it-nearly.html
Result:
Fixed sum overflow in Bzip2HuffmanAllocator, improved exceptions in ZlibDecoder implementations, hid Snappy class
Modifications:
- Added a static modifier for CompositeByteBuf.Component.
This class is an inner class, but does not use its embedded reference to the object which created it. This reference makes the instances of the class larger, and may keep the reference to the creator object alive longer than necessary.
- Removed unnecessary boxing/unboxing operations in HttpResponseDecoder, RtspResponseDecoder, PerMessageDeflateClientExtensionHandshaker and PerMessageDeflateServerExtensionHandshaker
A boxed primitive is created from a String, just to extract the unboxed primitive value.
- Removed unnecessary 3 times calculations in DiskAttribute.addContent(...).
- Removed unnecessary checks if file exists before call mkdirs() in NativeLibraryLoader and PlatformDependent.
Because the method mkdirs() has this check inside.
- Removed unnecessary `instanceof AsciiString` check in StompSubframeAggregator.contentLength(StompHeadersSubframe) and StompSubframeDecoder.getContentLength(StompHeaders, long).
Because StompHeaders.get(CharSequence) always returns java.lang.String.
Motivation:
We create a new CompactObjectInputStream with ByteBufInputStream in ObjectDecoder.decode(...) method and don't close this InputStreams before return statement.
Modifications:
Save link to the ObjectInputStream and close it before return statement.
Result:
Close InputStreams and clean up unused resources. It will be better for GC.
Motivation:
In the previous fix for #2667 I did introduce a bit overhead by calling setEpollOut() too often.
Modification:
Only call setEpollOut() if really needed and remove unused code.
Result:
Less overhead when saturate network.
Motivation:
As a DatagramChannel supports to write to multiple remote peers we must not close the Channel once a IOException accours as this error may be only valid for one remote peer.
Modification:
Continue writing on IOException.
Result:
DatagramChannel can be used even after an IOException accours during writing.
Motivation:
We need to continue write until we hit EAGAIN to make sure we not see an starvation
Modification:
Write until EAGAIN is returned
Result:
No starvation when using native transport with ET.
Motivation:
Because of a missing return statement we may produce a NPE when try to fullfill the connect ChannelPromise when it was fullfilled before.
Modification:
Add missing return statement.
Result:
No more NPE.
Motivation:
The handling of IOV_MAX was done in JNI code base which makes stuff really complicated to maintain etc.
Modifications:
Move handling of IOV_MAX to java code to simplify stuff
Result:
Cleaner code.
Motivation:
In our nio implementation we use write-spinning for maximize throughput, but in the native implementation this is not used.
Modification:
Respect writeSpinCount in native transport.
Result:
Better throughput
Motivation:
LZF compression codec provides sending and receiving data encoded by very fast LZF algorithm.
Modifications:
- Added Compress-LZF library which implements LZF algorithm
- Implemented LzfEncoder which extends MessageToByteEncoder and provides compression of outgoing messages
- Added tests to verify the LzfEncoder and how it can compress data for the next uncompression using the original library
- Implemented LzfDecoder which extends ByteToMessageDecoder and provides uncompression of incoming messages
- Added tests to verify the LzfDecoder and how it can uncompress data after compression using the original library
- Added integration tests for LzfEncoder/Decoder
Result:
Full LZF compression codec which can compress/uncompress data using LZF algorithm.
Motivation:
When we receive an incomplete WebSocketFrame we need to make sure to wait for more data. Because we not did this we could produce a NPE.
Modification:
Make sure we not try to add null into the RecyclableArrayList
Result:
no more NPE on incomplete frames.
Motivation:
I introduced ensureAccessible() class as part of 6c47cc9711 in some places. Unfortunally I also added some where these are not needed and so caused a performance regression.
Modification:
Remove calls where not needed.
Result:
Fixed performance regression.
Motivation:
I introduced range checks as part of 6c47cc9711 in some places. Unfortunally I also added some where these are not needed and so caused a performance regression.
Modification:
Remove range checks where not needed
Result:
Fixed performance regression.
Motivations:
In our new version of HWT we used some kind of lazy cancelation of timeouts by put them back in the queue and let them pick up on the next tick. This multiple problems:
- we may corrupt the MpscLinkedQueue if the task is used as tombstone
- this sometimes lead to an uncessary delay especially when someone did executed some "heavy" logic in the TimeTask
Modifications:
Use a Lock per HashedWheelBucket for save and fast removal.
Modifications:
Cancellation of tasks can be done fast and so stuff can be GC'ed and no more infinite-loop possible
Motivation:
HTTP header validation can be expensive so we should allow to disable it like we do in HttpObjectDecoder.
Modification:
Add constructor argument to disable validation.
Result:
Performance improvement
Motivation:
HttpObjectAggregator currently creates a new FullHttpResponse / FullHttpRequest for each message it needs to aggregate. While doing so it also creates 2 DefaultHttpHeader instances (one for the headers and one for the trailing headers). This is bad for two reasons:
- More objects are created then needed and also populate the headers is not for free
- Headers may get validated even if the validation was disabled in the decoder
Modification:
- Wrap the previous created HttpResponse / HttpRequest and so reuse the original HttpHeaders
- Reuse the previous created trailing HttpHeader.
- Fix a bug where the trailing HttpHeader was incorrectly mixed in the headers.
Result:
- Less GC
- Faster HttpObjectAggregator implementation
Motivation:
It's not always the case that there is another handler in the pipeline that will intercept the exceptionCaught event because sometimes users just sub-class. In this case the exception will just hit the end of the pipeline.
Modification:
Throw the TooLongFrameException so that sub-classes can handle it in the exceptionCaught(...) method directly.
Result:
Sub-classes can correctly handle the exception,
Motivation:
Collect all bit-level read operations in one class is better. And now it's easy to use not only in Bzip2Decoder. For example, in Bzip2HuffmanStageDecoder.
Modifications:
Created a new class - Bzip2BitReader which provides bit-level reads.
Removed bit-level read operations from Bzip2Decoder.
Improved javadoc.
Result:
Bzip2BitReader allows the reading of single bit booleans, bit strings of arbitrary length (up to 24 bits), and bit aligned 32-bit integers.
Motivation:
Currently when Native.writev(...) is used it is possible to see a JVM segfault because the offset is updated to early.
Modification:
Only update the offset once it is safe to do so.
Result:
No more segfault
Motivation:
epoll transport fails on gathering write of more then 1024 buffers. As linux supports max. 1024 iov entries when calling writev(...) the epoll transport throws an exception.
Thanks again to @blucas to provide me with a reproducer and so helped me to understand what the issue is.
Modifications:
Make sure we break down the writes if to many buffers are uses for gathering writes.
Result:
Gathering writes work with any number of buffers
Motivation:
CompositeByteBuf.deallocate generates unnecessary GC pressure when using the 'foreach' loop, as a 'foreach' loop creates an iterator when looping.
Modification:
Convert 'foreach' loop into regular 'for' loop.
Result:
Less GC pressure (and possibly more throughput) as the 'for' loop does not create an iterator
Motivation:
HttpOrSpdyChooser can be simplified so the user not need to implement getProtocol(...) method.
Modification:
Add implementation for the method. The user can override it if necessary.
Result:
Easier usage of HttpOrSpdyChooser.