Related issue: #2764
Motivation:
EpollSocketChannel.writeFileRegion() does not handle the case where the
position of a FileRegion is non-zero properly.
Modifications:
- Improve SocketFileRegionTest so that it tests the cases where the file
transfer begins from the middle of the file
- Add another jlong parameter named 'base_off' so that we can take the
position of a FileRegion into account
Result:
Improved test passes. Corruption is gone.
Related issue: #2508
Motivation:
The '<exec/>' task takes unnecessarily long time due to a known issue:
- https://issues.apache.org/bugzilla/show_bug.cgi?id=54128
Modifications:
- Reduce the number of '<exec/>' tasks for faster build
- Use '<propertyregex/>' to extract the output
Result:
Slightly faster build
Related issue: #2028
Motivation:
Some copiedBuffer() methods in Unpooled allocated a direct buffer. An
allocation of a direct buffer is an expensive operation, and thus should
be avoided for unpooled buffers.
Modifications:
- Use heap buffers in all copiedBuffer() methods
Result:
Unpooled.copiedBuffers() are less expensive now.
Motivation:
The previous fix did disable the caching of ByteBuffers completely which can cause performance regressions. This fix makes sure we use nioBuffers() for all writes in NioSocketChannel and so prevent data-corruptions. This is still kind of a workaround which will be replaced by a more fundamental fix later.
Modifications:
- Revert 4059c9f354
- Use nioBuffers() for all writes to prevent data-corruption
Result:
No more data-corruption but still retain the original speed.
Motivation:
At the moment we expand the ByteBuffer[] when we have more then 1024 ByteBuffer to write and replace the stored instance in its FastThreadLocal. This is not needed and may even harm performance on linux as IOV_MAX is 1024 and so this may cause the JVM to do an array copy.
Modifications:
Just exit the nioBuffers() method if we can not fit more ByteBuffer in the array. This way we will pick them up on the next call.
Result:
Remove uncessary array copy and simplify the code.
Motivation:
We cache the ByteBuffers in ChannelOutboundBuffer.nioBuffers() for the Entries in the ChannelOutboundBuffer to reduce some overhead. The problem is this can lead to data-corruption if an incomplete write happens and next time we try to do a non-gathering write.
To fix this we should remove the caching which does not help a lot anyway and just make the code buggy.
Modifications:
Remove the caching of ByteBuffers.
Result:
No more data-corruption.
Motivation:
Currently Traffic Shaping is using 1 timer only and could lead to
"partial" wrong bandwidth computation when "short" time occurs between
adding used bytes and when the TrafficCounter updates itself and finally
when the traffic is computed.
Indeed, the TrafficCounter is updated every x delay and it is at the
same time saved into "lastXxxxBytes" and set to 0. Therefore, when one
request the counter, it first updates the TrafficCounter with the added
used bytes. If this value is set just before the TrafficCounter is
updated, then the bandwidth computation will use the TrafficCounter with
a "0" value (this value being reset once the delay occurs). Therefore,
the traffic shaping computation is wrong in rare cases.
Secondly the traffic shapping should avoid if possible the "Timeout"
effect by not stopping reading or writing more than a maxTime, this
maxTime being less than the TimeOut limit.
Thirdly the traffic shapping in read had an issue since the readOp was
not set but should, turning in no read blocking from socket point of
view. (see #2696)
Take into account setAutoRead(boolean) setting directly
by the user in the program external to this handler.
Modifications:
The TrafficCounter has 2 new methods that compute the time to wait
according to read or write) using in priority the currentXxxxBytes (as
before), but could used (if current is at 0) the lastXxxxxBytes, and
therefore having more chance to take into account the real traffic.
Moreover the Handler could change the default "max time to wait", which
is by default set to half of "standard" Time Out (30s:2 = 15s).
Finally we add the setAutoRead(boolean) accordingly to the situation, as
proposed in #2696 (the original pull request is in error for unknown
reason so this merge).
Result:
The Traffic Shaping is better take into account (no 0 value when it
shouldn't) and it tries to not block traffic more than Time Out event.
Moreover the read is really stopped from socket point of view.
This version is similar to #2388 and #2450.
This version is for V4.0, and includes the #2696 pull request to ease
the merge process.
The test minimizes time check by reducing to 66ms steps (50s total).
Motivation:
Sometimes ChannelHandler need to queue writes to some point and then process these. We currently have no datastructure for this so the user will use an Queue or something like this. The problem is with this Channel.isWritable() will not work as expected and so the user risk to write to fast. That's exactly what happened in our SslHandler. For this purpose we need to add a special datastructure which will also take care of update the Channel and so be sure that Channel.isWritable() works as expected.
Modifications:
- Add PendingWriteQueue which can be used for this purpose
- Make use of PendingWriteQueue in SslHandler
Result:
It is now possible to queue writes in a ChannelHandler and still have Channel.isWritable() working as expected. This also fixes#2752.
In Netty 3, downstream writes of SPDY data frames and upstream reads of
SPDY window udpate frames occur on different threads.
When receiving a window update frame, we synchronize on a java object
(SpdySessionHandler::flowControlLock) while sending any pending writes
that are now able to complete.
When writing a data frame, we check the send window size to see if we
are allowed to write it to the socket, or if we have to enqueue it as a
pending write. To prevent races with the window update frame, this is
also synchronized on the same SpdySessionHandler::flowControlLock.
In Netty 4, upstream and downstream operations on any given channel now
occur on the same thread. Since java locks are re-entrant, this now
allows downstream writes to occur while processing window update frames.
In particular, when we receive a window update frame that unblocks a
pending write, this write completes which triggers an event notification
on the response, which in turn triggers a write of a data frame. Since
this is on the same thread it re-enters the lock and modifies the send
window. When the write completes, we continue processing pending writes
without knowledge that the window size has been decremented.
Motivation:
The calculation of the max wait time for HashedWheelTimerTest.testExecutionOnTime() was wrong and so the test sometimes failed.
Modifications:
Fix the max wait time.
Result:
No more test-failures
Related issue: #2743
Motivation:
When there are more than one stream with the same priority, the set
returned by SpdySession.getActiveStream() will not include all of them,
because it uses TreeSet and only compares the priority of streams. If
two different streams have the same priority, one of them will be
discarded by TreeSet.
Modification:
- Rename getActiveStreams() to activeStreams()
- Replace PriorityComparator with StreamComparator
Result:
Two different streams with the same priority are compared correctly.
Motivation:
We forgot to do a null check on the cause parameter of
ChannelFuture.setFailure(cause)
Modifications:
Add a null check
Result:
Fixed issue: #2728
Motivation:
If the requests contains uri parameters but not path the HttpRequestEncoder does produce an invalid uri while try to add the missing path.
Modifications:
Correctly handle the case of uri with paramaters but no path.
Result:
HttpRequestEncoder produce correct uri in all cases.
Motivation:
While trying to merge our ChannelOutboundBuffer changes we've made last
week, I realized that we have quite a bit of conflicting changes at 4.1
and master. It was primarily because we added
ChannelOutboundBuffer.beforeAdd() and moved some logic there, such as
direct buffer conversion.
However, this is not possible with the changes we've made for 4.0. We
made ChannelOutboundBuffer final for example.
Maintaining multiple branch is already getting painful and having
different core will make it even worse, so I think we should keep the
differences between 4.0 and other branches minimal.
Modifications:
- Move ChannelOutboundBuffer.safeRelease() to ReferenceCountUtil
- Add ByteBufUtil.threadLocalBuffer()
- Backported from ThreadLocalPooledDirectByteBuf
- Make most methods in AbstractUnsafe final
- Add AbstractChannel.filterOutboundMessage() so that a transport can
convert a message to another (e.g. heap -> off-heap), and also
reject unsupported messages
- Move all direct buffer conversions to filterOutboundMessage()
- Move all type checks to filterOutboundMessage()
- Move AbstractChannel.checkEOF() to OioByteStreamChannel, because it's
the only place it is used at all
- Remove ChannelOutboundBuffer.current(Object), because it's not used
anymore
- Add protected direct buffer conversion methods to AbstractNioChannel
and AbstractEpollChannel so that they can be used by their subtypes
- Update all transport implementations according to the changes above
Result:
- The missing extension point in 4.0 has been added.
- AbstractChannel.filterOutboundMessage()
- Thanks to the new extension point, we moved all transport-specific
logic from ChannelOutboundBuffer to each transport implementation
- We can copy most of the transport implementations in 4.0 to 4.1 and
master now, so that we have much less merge conflict when we modify
the core.
Related issue: #2733
Motivation:
Unlike OpenSsl, Epoll lacks a couple useful availability checker
methods:
- ensureAvailability()
- unavailabilityCause()
Modifications:
Add missing methods
Result:
More ways to check the availability and to get the cause of
unavailability programatically.
Motivation:
We expose ChannelOutboundBuffer in Channel.Unsafe but it is not possible
to create a new ChannelOutboundBuffer without an AbstractChannel. This
makes it impossible to write a Channel implementation that does not
extend AbstractChannel.
Modifications:
- Change ChannelOutboundBuffer to take a Channel as constructor argument.
- Add javadocs
Result:
ChannelOutboundBuffer can be used with a Channel implemention that does
not extend AbstractChannel.
Motivation:
Currently it is not possible to load an encrypted private key when
creating a JDK based SSL server context.
Modifications:
- Added static method to JdkSslServerContext which handles key spec generation for (encrypted) private keys and make use of it.
-Added tests for creating a SSL server context based on a (encrypted)
private key.
Result:
It is now possible to create a JDK based SSL server context with an
encrypted (password protected) private key.
Motivation:
Our ChannelOutboundBuffer implementation is not based on ArrayDeque anymore so we can remove the license notice for it.
Modifications:
Remove license of deque and entry in NOTICE.
Result:
Cleaned up licenses
Motivation:
When a ChannelOutboundBuffer contains a series of entries whose messages
are all empty buffers, EpollSocketChannel sometimes fails to remove
them. As a result, the result of the write(EmptyByteBuf) is never
notified, making the user application hang.
Modifications:
- Add ChannelOutboundBuffer.removeBytes(long) method that updates the
progress of the entries and removes them as much as the specified
number of written bytes. It also updates the reader index of
partially flushed buffer.
- Make both NioSocketChannel and EpollSocketChannel use it to reduce
code duplication
- Replace EpollSocketChannel.updateOutboundBuffer()
- Refactor EpollSocketChannel.doWrite() for simplicity
- Split doWrite() into doWriteSingle() and doWriteMultiple()
- Do not add a zero-length buffer to IovArray
- Do not perform any real I/O when the size of IovArray is 0
Result:
Another regression is gone.
Motivation:
As /proc/sys/net/core/somaxconn does not exists on non-linux platforms you see a noisy stacktrace when debug level is enabled while the static method of NetUtil is executed.
Modifications:
Check if the file exists before try to parse it.
Result:
Less noisy logging on non-linux platforms.
Related issue: #2717, #2710, #2704, #2693
Motivation:
When ChannelOutboundBuffer.nioBuffers() iterates over the linked list of
entries, it is not supposed to visit unflushed entries, but it does.
Modifications:
- Make sure ChannelOutboundBuffer.nioBuffers() stops the iteration before
it visits an unflushed entry
- Add isFlushedEntry() to reduce the chance of the similar mistakes
Result:
Another regression is gone.
Motivation:
ChannelOutboundBuffer.forEachFlushedMessage() visits even an unflushed
messages.
Modifications:
Stop the loop if the currently visiting entry is unflushedEntry.
Result:
forEachFlushedMessage() behaves correctly.
- ChannelOutboundBuffer.Entry.buffers -> bufs for consistency
- Make Native.IOV_MAX final because it's a constant
- Naming changes
- FlushedMessageProcessor -> MessageProcessor just in case we can
reuse it for unflushed messages in the future
- Add ChannelOutboundBuffer.Entry.recycle() that does not return the
next entry, and use it wherever possible
- Javadoc clean-up
Motivation:
While benchmarking the native transport, I noticed that gathering write
is not as fast as expected. It was due to the fact that we have to do a
lot of array copies to put the buffer addresses into the iovec struct
array.
Modifications:
Introduce a new class called IovArray, which allows to fill buffers
directly into an off-heap array of iovec structs, so that it can be
passed over to JNI without any extra array copies.
Result:
Big performance improvement when doing gathering writes:
Before:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 23.44ms 16.37ms 259.57ms 91.77%
Req/Sec 181.99k 31.69k 304.60k 78.12%
346544071 requests in 2.00m, 46.48GB read
Requests/sec: 2887885.09
Transfer/sec: 396.59MB
After:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.93ms 16.33ms 305.73ms 92.34%
Req/Sec 194.56k 33.75k 309.33k 77.04%
369617503 requests in 2.00m, 49.57GB read
Requests/sec: 3080169.65
Transfer/sec: 423.00MB
Motivation:
73dfd7c01b introduced various test
failures because:
- EpollSocketChannel.doWrite() raised a NullPointerException when
notifying the write progress.
- ChannelOutboundBuffer.nioBuffers() did not expand the internal array
when the pending entries contained more than 1024 buffers, dropping
the remainder.
Modifications:
- Fix the NPE in EpollSocketChannel by removing an unnecessary progress
update
- Expand the thread-local buffer array if there is not enough room,
which was the original behavior dropped by the offending commit
Result:
Regression is gone.
Motiviation:
ChannelOuboundBuffer uses often too much memory. This is especially a problem if you want to serve a lot of connections. This is due the fact that it uses 2 arrays internally. One if used as a circular buffer and store the Entries that are never released (ChannelOutboundBuffer is pooled) and one is used to hold the ByteBuffers that are used for gathering writes.
Modifications:
Rewrite ChannelOutboundBuffer to remove these two arrays by:
- Make Entry recyclable and use it as linked Node
- Remove the circular buffer which was used for the Entries as we use a Linked-List like structure now
- Remove the array that did hold the ByteBuffers and replace it by an ByteBuffer array that is hold by a FastThreadLocal. We use a fixed capacity of 1024 here which is fine as we share these anyway.
- ChannelOuboundBuffer is not recyclable anymore as it is now a "light-weight" object. We recycle the internally used Entries instead.
Result:
Less memory footprint and resource usage. Performance seems to be a bit better but most likely as we not need to expand any arrays anymore.
Benchmark before change:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 26.88ms 67.47ms 1.26s 97.97%
Req/Sec 191.81k 28.22k 255.63k 83.86%
364806639 requests in 2.00m, 48.92GB read
Requests/sec: 3040101.23
Transfer/sec: 417.49MB
Benchmark after change:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 22.22ms 17.22ms 301.77ms 90.13%
Req/Sec 194.98k 41.98k 328.38k 70.50%
371816023 requests in 2.00m, 49.86GB read
Requests/sec: 3098461.44
Transfer/sec: 425.51MB
Motivation:
We sometimes not use the correct exception message when throw it from the native code.
Modifications:
Fixed the message.
Result:
Correct message in exception
Motivation:
We have some inconsistency when handling writes. Sometimes we call ChannelOutboundBuffer.progress(...) also for complete writes and sometimes not. We should call it always.
Modifications:
Correctly call ChannelOuboundBuffer.progress(...) for complete and incomplete writes.
Result:
Consistent behavior
Motivation:
Due some race-condition while handling canellation of TimerTasks it was possibleto corrupt the linked-list structure that is represent by HashedWheelBucket and so produce a NPE.
Modification:
Fix the problem by adding another MpscLinkedQueue which holds the cancellation tasks and process them on each tick. This allows to use no synchronization / locking at all while introduce a latency of max 1 tick before the TimerTask can be GC'ed.
Result:
No more NPE
Motivation:
In ReplayingDecoder / ByteToMessageDecoder channelInactive(...) method we try to decode a last time and fire all decoded messages throw the pipeline before call ctx.fireChannelInactive(...). To keep the correct order of events we also need to call ctx.fireChannelReadComplete() if we read anything.
Modifications:
- Channel channelInactive(...) to call ctx.fireChannelReadComplete() if something was decoded
- Move out.recycle() to finally block
Result:
Correct order of events.
Motivation:
ChannelOutboundBuffer is basically a circular array queue of its entry
objects. Once an entry is created in the array, it is never nulled out
to reduce the allocation cost.
However, because it is a circular queue, the array almost always ends up
with as many entry instances as the size of the array, regardless of the
number of pending writes.
At worst case, a channel might have only 1 pending writes at maximum
while creating 32 entry objects, where 32 is the initial capacity of the
array.
Modifications:
- Reduce the initial capacity of the circular array queue to 4.
- Make the initial capacity of the circular array queue configurable
Result:
We spend 4 times less memory for entry objects under certain
circumstances.
- Rewrite with linear probing, no state array, compaction at cleanup
- Optimize keys() and values() to not use reflection
- Optimize hashCode() and equals() for efficient iteration
- Fixed equals() to not return true for equals(null)
- Optimize iterator to not allocate new Entry at each next()
- Added toString()
- Added some new unit tests
Motivation:
Message from FindBugs:
This method performs synchronization an object that is an instance of a class from the java.util.concurrent package (or its subclasses). Instances of these classes have their own concurrency control mechanisms that are orthogonal to the synchronization provided by the Java keyword synchronized. For example, synchronizing on an AtomicBoolean will not prevent other threads from modifying the AtomicBoolean.
Such code may be correct, but should be carefully reviewed and documented, and may confuse people who have to maintain the code at a later date.
Modification:
Use synchronized(this)
Result:
Less confusing code
Motivation:
At the moment we use Get*ArrayElement all the time in the epoll transport which may be wasteful as the JVM may do a memory copy for this. For code-path that will get executed fast (without blocking) we should better make use of GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical as this signal the JVM that we not want to do any memory copy if not really needed. It is important to only do this on non-blocking code-path as this may even suspend the GC to disallow the JVM to move the arrays around.
See also http://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#GetPrimitiveArrayCritical
Modification:
Make use of GetPrimitiveArrayCritical / ReleasePrimitiveArrayCritical as replacement for Get*ArrayElement / Release*ArrayElement where possible.
Result:
Better performance due less memory copies.
Motivation:
In EpollSocketchannel.writeBytesMultiple(...) we loop over all buffers to see if we need to adjust the readerIndex for incomplete writes. We can skip this if we know that everything was written (a.k.a complete write).
Modification:
Use fast-path if all bytes are written and so no need to loop over buffers
Result:
Fast write path for the average use.
Motivation:
At the moment ChannelOutboundBuffer.nioBuffers() returns null if something is contained in the ChannelOutboundBuffer which is not a ByteBuf. This is a problem for two reasons:
1 - In the javadocs we state that it will never return null
2 - We may do a not optimal write as there may be things that could be written via gathering writes
Modifications:
Change ChannelOutboundBuffer.nioBuffers() to never return null but have it contain all ByteBuffer that were found before the non ByteBuf. This way we can do a gathering write and also conform to the javadocs.
Result:
Better speed and also correct implementation in terms of the api.