Motivation:
Before struct's were passed per value and not pointer. This did enforce a memory copy which is not needed.
Modifications:
- Use "const struct....*" as replacement
Result:
No more unnecessary memory copies
Motivation:
When create address from filedescriptor we may use incorrect byte order and so end up with an incorrect InetAddress.
Modification:
Not manually shift bytes
Result:
Correct address in all cases.
Motivation:
There is a small race in the native transport where an accept(...) may success but a later try to obtain the remote address from the fd may fail is the fd is already closed.
Modifications:
Let accept(...) directly set the remote address.
Result:
No more race possible.
Motivation:
In the native transport we should throw a pre-instanced IOException on connection reset while reading.
Modifications:
Correctly throw pre-instanced IOException when ECONNRESET is received
Result:
Less overhead on connection reset
Motivation:
As we plan to have other native transports soon (like a kqueue transport) we should move unix classes/interfaces out of the epoll package so we
introduce other implementations without breaking stuff before the next stable release.
Modifications:
Create a new io.netty.channel.unix package and move stuff over there.
Result:
Possible to introduce other native impls beside epoll.
Motivation:
Sometimes it's useful to be able to create a Epoll*Channel from an existing file descriptor. This is especially helpful if you integrade some c/jni code.
Modifications:
- Add extra constructor to Epoll*Channel implementations that take a FileDescriptor as an argument
- Make Rename EpollFileDescriptor to NativeFileDescriptor and make it public
- Also ensure we obtain the correct remote/local address when create a Channel from a FileDescriptor
Result:
It's now possible to create a FileDescriptor and instance a Epoll*Channel via it.
Motivation:
Some of the methods are frequently called and so should be inlined if possible.
Modifications:
Give the compiler a hint that we want to inline these methods.
Result:
Better performance if inlined.
Motivation:
Older linux kernels have problems handling a large value for epoll_wait(...) and so wait for ever.
Modifications:
Adjust timeout on the fly if a too big value is passed in.
Result:
Correctly works also on older kernels.
Motivation:
Before we used a long[] to store the ready events, this had a few problems and limitations:
- An extra loop was needed to translate between epoll_event and our long
- JNI may need to do extra memory copy if the JVM not supports pinning
- More branches
Modifications:
- Introduce a EpollEventArray which allows to directly write in a struct epoll_event* and pass it to epoll_wait.
Result:
Better speed when using native transport, as shown in the benchmark.
Before:
[xxx@xxx wrk]$ ./wrk -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/pipeline-many.lua http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 14.56ms 8.64ms 117.15ms 80.58%
Req/Sec 286.17k 38.71k 421.48k 68.17%
546324329 requests in 2.00m, 73.78GB read
Requests/sec: 4553438.39
Transfer/sec: 629.66MB
After:
[xxx@xxx wrk]$ ./wrk -H 'Connection: keep-alive' -d 120 -c 256 -t 16 -s scripts/pipeline-many.lua http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 14.12ms 8.69ms 100.40ms 83.08%
Req/Sec 294.79k 40.23k 472.70k 66.75%
555997226 requests in 2.00m, 75.08GB read
Requests/sec: 4634343.40
Transfer/sec: 640.85MB
Motivation:
Netty uses edge-triggered epoll by default for performance reasons. The downside here is that a messagesPerRead limit can not be enforced correctly, as we need to consume everything from the channel when notified.
Modification:
- Allow to switch epoll modes before channel is registered
- Some refactoring to share more code
Result:
It's now possible to switch epoll mode.
Motiviation:
When using domain sockets on linux it is supported to recv and send file descriptors. This can be used to pass around for example sockets.
Modifications:
- Add support for recv and send file descriptors when using EpollDomainSocketChannel.
- Allow to obtain the file descriptor for an Epoll*Channel so it can be send via domain sockets.
Result:
recv and send of file descriptors is supported now.
Motivation:
Using Unix Domain Sockets can be very useful when communication should take place on the same host and has less overhead then using loopback. We should support this with the native epoll transport.
Modifications:
- Add support for Unix Domain Sockets.
- Adjust testsuite to be able to reuse tests.
Result:
Unix Domain Sockets are now support when using native epoll transport.
Motivation:
On Linux, you can gather various metrics using getsockopt(..., TCP_INFO,
...).
Modifications:
Add EpollSocketChannel.tcpInfo() which returns EpollTcpInfo that exposes
all metrics exposed via getsockopt(..., TCP_INFO, ...)
Result:
TCP_INFO support implemented
Rebased and cleaned-up based on the work by @normanmaurer
Motivation:
Currently, IOExceptions and ClosedChannelExceptions are thrown from
inside the JNI methods. Instantiation of Java objects inside JNI code is
an expensive operation, needless to say about filling stack trace for
every instantiation of an exception.
Modifications:
Change most JNI methods to return a negative value on failure so that
the exceptions are instantiated outside the native code.
Also, pre-instantiate some commonly-thrown exceptions for better
performance.
Result:
Performance gain
Motivation:
Everytime a new connection is accepted via EpollSocketServerChannel it will create a new EpollSocketChannel that needs to get the remote and local addresses in the constructor. The current implementation uses new InetSocketAddress(String, int) to create these. This is quite slow due the implementation in oracle and openjdk.
Modifications:
Encode all needed informations into a byte array before return from jni layer and then use new InetSocketAddress(InetAddress, int) to create the socket addresses. This allows to create the InetAddress via a byte[] and so reduce the overhead, this is done either by using InetAddress.getByteAddress(byte[]) or by Inet6Address.getByteAddress(String, byte[], int).
Result:
Reduce performance overhead while accept new connections with native transport
Motivation:
We only provided a constructor in DefaultFileRegion that takes a FileChannel which means the File itself needs to get opened on construction. This has the problem that if you want to write a lot of Files very fast you may end up with may open FD's even if they are not needed yet. This can lead to hit the open FD limit of the OS.
Modifications:
Add a new constructor to DefaultFileRegion which allows to construct it from a File. The FileChannel will only be obtained when transferTo(...) is called or the DefaultFileRegion is explicit open'ed via open() (this is needed for the native epoll transport)
Result:
Less resource usage when writing a lot of DefaultFileRegion.
Motivation:
We use malloc(1) in the on JNI_OnLoad method but never free the allocated memory. This means we have a tiny memory leak of 1 byte.
Modifications:
Call free(...) on previous allocated memory.
Result:
Fix memory leak
Motiviation:
If sendmmsg is already defined then the native epoll module failed to build because of conflicting definitions.
The mmsghdr type was also redefined on systems that already supported this structure.
Modifications:
Provide a way so that systems which already define sendmmsg and mmsghdr can build
Provide a way so that systems which don't define sendmmsg and mmsghdr can build
Result:
The native EPOLL module can build in more environments
Motivation:
On linux with glibc >= 2.14 it is possible to send multiple DatagramPackets with one syscall. This can be a huge performance win and so we should support it in our native transport.
Modification:
- Add support for sendmmsg by reuse IovArray
- Factor out ThreadLocal support of IovArray to IovArrayThreadLocal for better separation as we use IovArray also without ThreadLocal in NativeDatagramPacketArray now
- Introduce NativeDatagramPacketArray which is used for sendmmsg(...)
- Implement sendmmsg(...) via jni
- Expand DatagramUnicastTest to test also sendmmsg(...)
Result:
Netty now automatically use sendmmsg(...) if it is supported and we have more then 1 DatagramPacket in the ChannelOutboundBuffer and flush() is called.
Motivation:
On linux it is possible to use the sendMsg(...) system call to write multiple buffers with one system call when using datagram/udp.
Modifications:
- Implement the needed changes and make use of sendMsg(...) if possible for max performance
- Add tests that test sending datagram packets with all kind of different ByteBuf implementations.
Result:
Performance improvement when using CompoisteByteBuf and EpollDatagramChannel.
Motivation:
InetAddress.getByName(...) uses exceptions for control flow when try to parse IPv4-mapped-on-IPv6 addresses. This is quite expensive.
Modifications:
Detect IPv4-mapped-on-IPv6 addresses in the JNI level and convert to IPv4 addresses before pass to InetAddress.getByName(...) (via InetSocketAddress constructor).
Result:
Eliminate performance problem causes by exception creation when parsing IPv4-mapped-on-IPv6 addresses.
Related issue: #2764
Motivation:
EpollSocketChannel.writeFileRegion() does not handle the case where the
position of a FileRegion is non-zero properly.
Modifications:
- Improve SocketFileRegionTest so that it tests the cases where the file
transfer begins from the middle of the file
- Add another jlong parameter named 'base_off' so that we can take the
position of a FileRegion into account
Result:
Improved test passes. Corruption is gone.
Motivation:
While benchmarking the native transport, I noticed that gathering write
is not as fast as expected. It was due to the fact that we have to do a
lot of array copies to put the buffer addresses into the iovec struct
array.
Modifications:
Introduce a new class called IovArray, which allows to fill buffers
directly into an off-heap array of iovec structs, so that it can be
passed over to JNI without any extra array copies.
Result:
Big performance improvement when doing gathering writes:
Before:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 23.44ms 16.37ms 259.57ms 91.77%
Req/Sec 181.99k 31.69k 304.60k 78.12%
346544071 requests in 2.00m, 46.48GB read
Requests/sec: 2887885.09
Transfer/sec: 396.59MB
After:
[nmaurer@xxx]~% wrk/wrk -H 'Host: localhost' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Connection: keep-alive' -d 120 -c 256 -t 16 --pipeline 256 http://xxx:8080/plaintext
Running 2m test @ http://xxx:8080/plaintext
16 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.93ms 16.33ms 305.73ms 92.34%
Req/Sec 194.56k 33.75k 309.33k 77.04%
369617503 requests in 2.00m, 49.57GB read
Requests/sec: 3080169.65
Transfer/sec: 423.00MB
Motivation:
We sometimes not use the correct exception message when throw it from the native code.
Modifications:
Fixed the message.
Result:
Correct message in exception
Motivation:
At the moment we use Get*ArrayElement all the time in the epoll transport which may be wasteful as the JVM may do a memory copy for this. For code-path that will get executed fast (without blocking) we should better make use of GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical as this signal the JVM that we not want to do any memory copy if not really needed. It is important to only do this on non-blocking code-path as this may even suspend the GC to disallow the JVM to move the arrays around.
See also http://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#GetPrimitiveArrayCritical
Modification:
Make use of GetPrimitiveArrayCritical / ReleasePrimitiveArrayCritical as replacement for Get*ArrayElement / Release*ArrayElement where possible.
Result:
Better performance due less memory copies.
Motivation:
We need to continue write until we hit EAGAIN to make sure we not see an starvation
Modification:
Write until EAGAIN is returned
Result:
No starvation when using native transport with ET.
Motivation:
The handling of IOV_MAX was done in JNI code base which makes stuff really complicated to maintain etc.
Modifications:
Move handling of IOV_MAX to java code to simplify stuff
Result:
Cleaner code.
Motivation:
epoll transport fails on gathering write of more then 1024 buffers. As linux supports max. 1024 iov entries when calling writev(...) the epoll transport throws an exception.
Thanks again to @blucas to provide me with a reproducer and so helped me to understand what the issue is.
Modifications:
Make sure we break down the writes if to many buffers are uses for gathering writes.
Result:
Gathering writes work with any number of buffers
Motivation:
Currently it is impossible to build netty on linux system that not define SO_REUSEPORT even if it is supported.
Modification:
Define SO_REUSEPORT if not defined.
Result:
Possible to build on more linux dists.
Motivation:
When we do a (env*)->GetObjectArrayElement(...) call we may created many local references which will only be cleaned up once we exist the native method. Thus a lot of memory can be used and so a StackOverFlow may be triggered. Beside this the JNI specification only say that an implementation must cope with 16 local references.
Modification:
Call (env*)->ReleaseLocalRef(...) to release the resource once not needed anymore.
Result:
Less memory usage and guard against StackOverflow
Motivation:
Allow to set TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT in native transport to offer the user with more flexibility.
Modifications:
Expose methods to set these options and write the JNI implementation.
Result:
User can now use TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT.
Motivation:
There is currently no epoll based DatagramChannel. We should add one to make the set of provided channels complete and also to be able to offer better performance compared to the NioDatagramChannel once SO_REUSEPORT is implemented.
Modifications:
Add implementation of DatagramChannel which uses epoll. This implementation does currently not support multicast yet which will me implemented later on. As most users will not use multicast anyway I think it is fair to just add the EpollDatagramChannel without the support for now. We shipped NioDatagramChannel without support earlier too ...
Result:
Be able to use EpollDatagramChannel for max. performance on linux
Motivation:
In linux kernel 3.9 a new featured named SO_REUSEPORT was introduced which allows to have multiple sockets bind to the same port and so handle the accept() of new connections with multiple threads. This can greatly improve the performance when you not to accept a lot of connections.
Modifications:
Implement SO_REUSEPORT via JNI
Result:
Be able to use the SO_REUSEPORT feature when using the EpollServerSocketChannel
Motivation:
We sometimes see data corruption when writing to the EpollSocketChannel.
Modifications:
Correctly update the position of the ByteBuffer after something was written.
Result:
Fix data-corruption which could happen on partial writes
Motivation:
Native.epollCreate(...) fails on systems using a kernel < 2.6.27 / glibc < 2.9 because it uses epoll_create1(...) without checking if it is present
Modifications:
Check if epoll_create1(...) exists abd if not fall back to use epoll_create(...)
Result:
Works even on systems with kernel < 2.6.27 / glibc < 2.9
This transport use JNI (C) to directly make use of epoll in Edge-Triggered mode for maximal performance on Linux. Beside this it also support using TCP_CORK and produce less GC then the NIO transport using JDK NIO.
It only builds on linux and skip the build if linux is not used. The transport produce a jar which contains all needed .so files for 32bit and 64 bit. The user only need to include the jar as dependency as usually
to make use of it and use the correct classes.
This includes also some cleanup of @trustin