Motivation:
Cursors are better than iterators in that they only need to check boundary conditions once per iteration, when processed in a loop.
This should make them easier for the compiler to optimise.
Modification:
Change the ByteIterator to a ByteCursor. The API is almost the same, but with a few subtle differences in semantics.
The primary difference is that the cursor movement and boundary condition checking and position movement happen at the same time, and do not need to occur when the values are fetched out of the cursor.
An iterator, on the other hand, needs to throw an exception if "next" is called too many times.
Result:
Simpler code, and hopefully faster code as well.
Motivation:
We were seeing rare test failures where a cleaner had raced to close a memory segment we were using or closing.
The cause was that a single MemorySegment ended up used in multiple Buf instances.
When the SizeClassedMemoryPool was closed, the memory segments could be disposed without closing the gate in the NativeMemoryCleanerDrop.
The gate is important because it prevents double-frees of the memory segment.
Modification:
The fix is to change how the SizeClassedMemoryPool is closed, such that it always releases memory by calling `close()` on its buffers, which in turn will close the gate. The program will then proceed through the SizeClassedMemoryPool.drop implementation, which in turn will observe that the allocator is closed, and *then* dispose of the memory.
Result:
We should hopefully not see any more random test failures, but if we do, they would at least indicate a different bug.
This particular one was mostly showing up inside the cleaner threads, which were ignoring the exception, but occasionally I think the race went the other way, causing a test failure.
Motivation:
This will likely be a somewhat common operation, as buffers move between eventloop and worker threads, so it's important to have an understanding of how it performs.
Modification:
Add a benchmark that specifically targets the send() operation on buffers.
Result:
We got benchmark numbers that clearly show the cost of confinement transfer
Motivation:
Composite buffers are uniquely positioned to be able to extend their underlying storage relatively cheaply.
This fact is relied upon in a couple of buffer use cases within Netty, that we wish to support.
Modification:
Add a static `extend` method to Allocator, so that the CompositeBuf class can remain internal.
The `extend` method inserts the extension buffer at the end of the composite buffer as if it had been included from the start.
This involves checking offsets and byte order invariants.
We also require that the composite buffer be in an owned state.
Result:
It's now possible to extend a composite buffer with a specific buffer, after the composite buffer has been created.
Motivation:
Capture the performance characteristics of this primitive for various buffer implementations.
Modification:
Add a benchmark that iterate 4KiB buffers forwards, and backwards, on various buffer implementations.
Result:
Another aspect of the implementation covered by benchmarks.
Turns out the composite iterators a somewhat slow.
Motivation:
When a build fails, it's desirable to have the build artifacts available
so the build failure can be inspected, investigated and hopefully fixed.
Modification:
Change the Makefile and CI workflow such that the build artifacts are
captured and uploaded when the build fails.
Result:
A "target.zip" file is now available for download on failed GitHub
Actions builds.
Motivation:
Pooled buffers are a very important use case, and they change the cost dynamics around shared memory segments, so it's worth looking into in detail.
Modification:
Add another explicit close of pooled direct buffers to MemorySegmentClosedByCleanerBenchmark
Result:
Explicitly closing of pooled buffers is even out-performing cleaner close on the "heavy" workload, so this is currently the fastest way to run that workload:
Benchmark (workload) Mode Cnt Score Error Units
MemorySegmentClosedByCleanerBenchmark.cleanerClose heavy avgt 150 14,194 ± 0,558 us/op
MemorySegmentClosedByCleanerBenchmark.explicitClose heavy avgt 150 40,496 ± 0,414 us/op
MemorySegmentClosedByCleanerBenchmark.explicitPooledClose heavy avgt 150 12,723 ± 0,134 us/op
When send() a confined buffer, we had to first turn it into a shard buffer, so that it could be claimed by an arbitrary recipient thread.
As we've learned, however, closing shared segments is expensive.
We can speed up the send() call by simply leaving the segment shared.
This weakens the confinement of the received segment, though.
Currently no tests fails on that, but in the future we should re-implement confinement checking inside the Buf implementations themselves any, because pooled buffers also violate the confinement restriction, and we have a guiding principle that all buffers, regardless of implementation, should always behave the same.
The results of this change can be observed in the MemorySegmentClosedByCleanerBenchmark, with the heavy workload.
Explicitly closed segments now run the workload twice as fast, and the cleaner based closing is now 3 times faster.
Before:
Benchmark (workload) Mode Cnt Score Error Units
MemorySegmentClosedByCleanerBenchmark.cleanerClose heavy avgt 150 42,221 ± 0,943 us/op
MemorySegmentClosedByCleanerBenchmark.explicitClose heavy avgt 150 65,215 ± 0,761 us/op
After:
Benchmark (workload) Mode Cnt Score Error Units
MemorySegmentClosedByCleanerBenchmark.cleanerClose heavy avgt 150 13,871 ± 0,544 us/op
MemorySegmentClosedByCleanerBenchmark.explicitClose heavy avgt 150 37,516 ± 0,426 us/op
Also schedule a build to run at 06:30 in the morning, every Monday.
This way, the JDK and Netty 5 snapshots will be updated in the build cache every week.
Motivation:
Buffers should always behave the same, regardless of their underlying implementation and how they are allocated.
Modification:
The SizeClassedMemoryPool did not properly reset the internal buffer state prior to reusing them.
The offsets, byte order, and contents are now cleared before a buffer is reused.
Result:
There is no way to observe externally whether a buffer was reused or not.
Motivation:
Because of the current dependency on snapshot versions of the Panama Foreign version of OpenJDK 16, this project is fairly involved to build.
Modification:
To make it easier for newcomers to build the binaries for this project, a docker-based build is added.
The docker image is constructed such that it contains a fresh snapshot build of the right fork of Java.
A make file has also been added, which encapsulates the common commands one would use for working with the docker build.
Result:
It is now easy for newcomers to make builds, and run tests, of this project, as long as they have a working docker installation.
Motivation:
Resource lifetime was not correctly handled.
Modification:
We cannot call drop(buf) on a pooled buffer in order to release its memory from within ensureWritable, because that will send() the buffer back to the pool, which implies closing the buffer instance.
Instead, ensureWritable has to always work with untethered memory, so new APIs are added to AllocatorControl for releasing untethered memory.
The implementation already existed, because it was used by NativeMemoryCleanerDrop.
Result:
Buf.ensureWritable no longer closes pooled buffers.
Motivation:
Having buffers that are able to expand to accommodate more data on demand is a great convenience.
Modification:
Composite and MemSeg buffers are now able to mutate their backing storage, to increase their capacity.
This required some tricky integration with allocators via AllocatorControl.
Basically, it's now possible to allocate memory that is NOT bound by any life time, so that it can be attached to the life time that already exists for the buffer being expanded.
Result:
Buffers can now be expanded via Buf.ensureWritable.
Motivation:
Slices should behave identical to normal and composite buffers in all but a very select few aspects related to ownership.
Modification:
Extend the test generation to also produce slice-versions of nearly all test cases. Both slices that cover the entire buffer, and slices that only cover a part of their parent buffer.
Also fix a handful of bugs that this uncovered.
Result:
Buffer slices are now tested much more thoroughly, and a few bugs were fixed.