News in uvco

Sun, Dec 7, 2025 tags: [ Programming C++ Uvco Async ]

How time passes! This is already the third article discussing my uvco library for asynchronous programming in C++ - the uvco tag shows all past and future articles.

The reason for this installment are several fundamental improvements I’ve implemented in recent weeks, turning the fledgling, sometimes unstable and hard to use uvco library discussed in those previous articles, into a more robust and scalable framework for almost-serious asynchronous programming in C++. In addition, several benchmarks have been implemented to better quantify the throughput and latency numbers currently achievable with uvco.

Before getting into the details, I want to highlight the use of libuv, which is where uvco gets its name from, and which has turned out to be a (reasonably) pleasant and robust (when used correctly) foundation. The libuv developers have done amazing work, and I look forward to future releases; what uvco gets from libuv for free is unified cross-platform I/O, using whichever asynchronous OS facility works best, including the new io_uring capability present on recent Linux kernels - libuv is able to use this for truly asynchronous file I/O.

Overview

While I recommend the other uvco articles for a more general look at the implementation, the library has changed enough that looking at those outdated articles may actually not be the best idea. So, what is uvco?

uvco is Python asyncio/Node.js/Rust .await-style asynchronous programming for C++, leveraging the coroutine mechanism introduced in C++20. It allows writing fully asynchronous code in a way that looks synchronous - that means: no callbacks, no strange control flow, etc. Instead, any asynchronous operation is co_awaited, suspending a coroutine and letting the runtime scheduler yield back control to it later. There is nothing new in uvco, except that it’s a fairly small implementation with few dependencies; other implementations like kj and boost::asio often bring their own class library or have a very distinct style and complex details. uvco, in contrast, opts for doing things in the simplest way possible, imposing lower demands on application code that uses it.

There are a few important classes in uvco (all of them documented), but if I had to pick just one, it would be the class template Promise<T>. If you’ve worked with other asynchronous languages or frameworks, you already know what’s coming: The usual phrasing is something like “A Promise<T> represents a handle to a value of type T that is not yet available; at some point in the future it may resolve, returning the value.” As I haven’t done any significant innovation in writing uvco, that holds true all the same. T can be void, by the way, representing an ongoing operation that may fail, but for which we are not interested in a specific result.

A Promise<T> is always backed by a coroutine. In fact, there is no other way to create a Promise<T> than to call a coroutine! A coroutine in uvco, and in C++ more generally, is a function using any of the co_await/co_yield/co_return keywords. In uvco, such a function typically returns a Promise<T>, although there is one other promise-like type, which we can discuss later.

All the prose in the world can’t explain what a piece of code can demonstrate to an experienced programmer, the “feel” of a library. So let’s do that - the following coroutine is taken from the uvco/examples/http_server.cc, which implements a working (but very simple) HTTP server. This coroutine is responsible for reading the headers section off an uvco::TcpStream, and returning them as one string:

#include <uvco/examples/http_server.h>

using namespace uvco;

// ...

Promise<std::optional<std::string>> readHeaders(TcpStream &stream) {
  std::string requestBuffer;
  char buffer[4096];
  size_t headersEndPos = std::string::npos;

  while (headersEndPos == std::string::npos) {
    if (requestBuffer.length() > MAX_HEADER_SIZE) {
      co_return std::nullopt;
    }
    auto bytesRead = co_await stream.read(std::span(buffer));
    if (bytesRead == 0) {
      co_return std::nullopt;
    }
    requestBuffer.append(buffer, bytesRead);
    headersEndPos = requestBuffer.find("\r\n\r\n");
  }

  co_return requestBuffer.substr(0, headersEndPos + 4);
}

As promised (pun possibly intended), this code almost reads as if it were reading the header section of a normal synchronous socket; the only difference is the use of uvco’s classes, and co_await, which will let uvco suspend and later resume the running coroutine whenever data has arrived on the stream.

High-level Improvements

Ownership and Cancellation

In the early days of uvco, Promise<T> was a reference counted class with a “weak feel” to it. The reasons were two-fold:

Promises could be copied; more than one Promise referring to the same coroutine was allowed and used.
Even after the last promise referring to a coroutine was dropped, the coroutine would keep running.
Promises could be value-constructed, representing an already present value.

This meant that control over execution in an uvco program was not very precise, and cancelling ongoing coroutines was not possible at all. Even trivial behaviors, common in asynchronous programs, like cancelling a backend request if a client connection breaks, were impossible to implement. This made uvco a nice toy, but not suitable for implementing actual software.

This has irked me for a long time, and even though I hadn’t managed to find a good way of addressing it without rewriting the whole library - about two months ago I got started on adding this improvement, and eventually succeeded.

Now, Promise<T> fulfills the following characteristics:

The only way to obtain a Promise<T> is to implement and call a coroutine co_returning T.
A Promise<T> cannot be default- or copy-constructed; it can only be moved
There is only one Promise<T> for a given coroutine.
Dropping a Promise<T> will cancel the underlying coroutine, including correct cleanup of any pending I/O operations, timers, etc. (with one known exception, described below)

This was a bit more difficult to implement, but turned out to be much easier to maintain, and has simplified application code by a lot. The main reason for making it more difficult to write library code is that every coroutine (especially the “root” coroutines, which connect uvco’s Promises with libuv primitives) now have multiple exit points, specifically, every suspension point (co_await or co_yield) could now be a coroutine’s last executed code. A suspended coroutine must be prepared to be destroyed and never resumed. However, using RAII and implementing meaningful destructors for the awaitables at the lowest level (directly above libuv), have made this behavior robust and dependable.

Tests have been added testing cancellation, and cancellations work recursively; meaning, even a nested coroutine call will be cancelled correctly and clean up all held resources if the “root” promise is dropped. The one exception are fs operations; on many platforms, libuv uses a thread pool for file operations; and there are conditions where a file operation has already started, where cancellation is not possible anymore. On some platforms - actually, just Github Actions Ubuntu 22.04 runners - this fails in rare cases.

Synchronous Close

Related, and not any less irksome, was the requirement of previous uvco versions to call

co_await handle.close();

for almost all handles before dropping them. This was the case as libuv considers closing a handle an asynchronous operation, even though it almost always succeeds instantaneously. This requirement also blocked automatic cancellations as there was no clear way to circumvent this, and specifically, the problem that a handle’s memory can only be freed after the close operation has finished, which typically only happens in the next turn of the event loop.

However, I’ve managed to find a way around this: it turns out that the libuv loop keeps running as long as there are active handles on it; and as closing handles works quite quickly, we can exploit this to (in uvco) pretend that close() is actually synchronous. The other part of the equation is memory deallocation; how do we free libuv handle memory only after closing has completed, at a time when our uvco handle has already been destroyed? For reference, a “libuv handle” is an instance of (for example) uv_tcp_t. In this case, libuv handles had to move behind unique_ptrs in almost all cases, and now, when calling uvco::closeHandle() internally, closeHandle() can detect if it was used in a synchronous manner (i.e., was not awaited). In that case it knows that the libuv handle memory still has to be freed, and can do so.

Just like that, application code can now implicitly close handles simply by relying on their destructor - just like we expect in a C++ library:

// from test/tcp_test.cc

Promise<void> serverLoop(MultiPromise<TcpStream> clients) {
  while (true) {
    std::optional<TcpStream> maybeClient = co_await clients;
    if (!maybeClient) {
      co_return;
    }
    TcpStream& client{maybeClient.value()};
    std::optional<std::string> chunk = co_await client.read();
    co_await client.writeBorrowed(*chunk);

    // Nothing needs to be closed here explicitly! Asan is still happy.
  }
}

Promise<void> sendReceivePing(const Loop &loop, AddressHandle addr) {
  TcpClient client{loop, addr};
  TcpStream stream = co_await client.connect();

  co_await stream.write("Ping");
  std::optional<std::string> response = co_await stream.read();
  EXPECT_EQ(response, "Ping");

  // Neither here!
}

TEST(TcpTest, repeatedConnectSingleServerCancel1) {
  auto setup = [&](const Loop &loop) -> Promise<void> {
    AddressHandle addr{"127.0.0.1", 0};
    TcpServer server{loop, addr};
    const AddressHandle actual = server.getSockname();

    Promise<void> serverHandler = serverLoop(server.listen());
    co_await sendReceivePing(loop, actual);
    co_await sendReceivePing(loop, actual);
    co_await sendReceivePing(loop, actual);
  };
  run_loop(setup);
}

Fewer Allocations

PromiseCore<T>, a class used to instantiate objects shared by the coroutine and its promise(s), used to always be allocated dynamically, and was reference-counted to enable the behavior described in the first section above. There was nothing the compiler could have done to optimize this. This meant two allocations for every coroutine call: one to allocate the coroutine frame (which some compilers could elide, some of the time); and one allocation for the promise core, pointed to by both the coroutine frame and one or more promise objects.

As part of the “renovation”, the PromiseCore<T> class still exists all the same. However, because there is now only one Promise<T> object for a given coroutine, the ownership structure is more clear now: The PromiseCore<T> object lives as a field inline within the coroutine frame, and is pointed at by the Promise<T>. A coroutine call now results in at most one allocation, and whenever possible a compiler may choose to elide even that.

This did leave me with one issue: How to ensure that the promise core would outlive the Promise pointing at it? For this, uvco now makes use of the final_suspend() hook provided by the Coroutine<T> class (which is Promise<T>' promise_type, in C++ parlance). This hook now tells the runtime to leave the coroutine frame untouched after the coroutine itself exits. The responsibility for cleaning up the coroutine now falls to the Promise<T>::~Promise destructor; enabled by the fact that 1) there is only one Promise per coroutine, and 2) dropping a Promise<T> cancels the associated coroutine, leaving no window in time where the coroutine still runs without a promise pointing to it.

Combinators

Enabled by the new Promise<T> semantics, a new small combinators library has been added to uvco. It provides meta-coroutines, which allow combining existing promises from other sources into new ones with a specific behavior. What does this mean in practice?

// Resolve when the first promise resolves, cancelling all others.
// Returns results for the first finished coroutines
// (may be more than one).
template <typename... PromiseTypes>
Promise<std::vector<std::variant<PromiseTypes...>>>
race(Promise<PromiseTypes>... promises);

// Same as race() but ignore the results.
template <typename... PromiseTypes>
Promise<void> raceIgnore(Promise<PromiseTypes>... promises);

// Nice version of race(), not cancelling any coroutines.
template <typename... PromiseTypes>
Promise<std::vector<std::variant<PromiseTypes...>>>
waitAny(Promise<PromiseTypes> &...promises);

// Wait for all promises to finish and return all results.
template <typename... PromiseTypes>
Promise<std::tuple<typename detail::ReplaceVoid<PromiseTypes>::type...>>
waitAll(Promise<PromiseTypes>... promises);

// Let coroutines run in the background, cleaning up after any that finish;
// allows owning a dynamic number of coroutines, where otherwise a
// vector<Promise<void>> or similar would be necessary.
// Also logs any failures unless a handler is supplied.
class TaskSet;

A pattern common in any server-type application is having one coroutine per client, running in the background until the client disconnects or has finished. For this, we need a way to own coroutines (otherwise they’d be cancelled), but we want to clean up whenever one finishes in order to not leak resources, as the server’s main loop may run for an indefinite amount of time. A TaskSet is the right choice here:

// from http_server.cc

Promise<void> httpServer(TcpServer &server, Router router) {
  auto streamPromise = server.listen();

  auto handlers = TaskSet::create();
  while (auto stream = co_await streamPromise.next()) {
    handlers->add(connectionHandler(std::move(*stream), router));
  }
}

In combination, we almost have an “algebra of coroutines”, allowing us to compose coroutines in almost arbitrary ways. The existing combinators certainly go all the way yet, and I’m looking forward to add more later on.

That’s mostly it for the user-facing changes. I found the new uvco to be much more robust against misuse and less prone to crashing due to a minor oversight in applicaton code. There are examples maintained in the test/ directory, such as tcp-broadcaster.exe.cc, memcached-impl.exe.cc, and http-server in the benchmarks/ directory. They serve both to put a number on uvco performance (and help with improvements), as well as demonstrating and checking that uvco is suitable for real-world network programming.

Benchmarks

At some point or another, people will ask about an async library’s performance. It’s not enough to make programmers' lives easier - no! We also have to provide sufficient throughput to solve interesting problems. Especially in uvco’s case, where we consciously limit operations to a single thread in order to eschew locking and race condition concerns, the overhead of the library matters quite a bit.

Realistic Benchmarks

The first part are realistic benchmarks. For example the memcached-impl.exe.cc, which is commpiled into the test-memcached-impl executable. Using multiple instances of the supplied memcached_client.py script, we get the following throughput numbers on my AMD Ryzen 7 PRO 7840U CPU, using the standard ondemand governor:

1 client, 100k x 2 (get/set) requests: 40k req/s (25 µs/req)
2 clients, 100k x 2 (get/set) x 2 requests: 69k req/s
3 clients, 100k x 2 x 3 requests: 104k req/s (10 µs/req)
4 clients: 100k x 2 x 4 requests: 114k req/s (8.8 µs/req)

These numbers are measured on an executable built in Release mode with LTO enabled. The following figure shows the flamegraph generated by perf record -g -F 13999, using the cycles:P counter.

Flamegraph generated by perf script report flamegraph

It’s clear from this profile that really most time is spent in the kernel (blue), doing network I/O. Some time is spent in handleClient() in user space, which is mostly the request parsing and response formatting. Overhead from uvco is fairly minimal; only about 10% of time is spent in userspace outside of libuv or the kernel.

A second realistic benchmark is the http-server executable. It links against the http_server from uvco/examples/, exporting just two handlers over a HTTP/1.1-ish interface (keep-alive is supported but not much more). Using Apache Bench (ab) with the following invocation, the simple server handles almost 150k requests every second, with a peak latency of 2ms. Obviously, serving a response of 12 bytes is not a realistic benchmark, but the handling of concurrent text-based requests over TCP is :)

$ ab -n 500000 -c25 -k http://localhost:8000/

Document Path:          /
Document Length:        12 bytes

Concurrency Level:      25
Time taken for tests:   3.420 seconds
Complete requests:      500000
Failed requests:        0
Keep-Alive requests:    500000
Total transferred:      37500000 bytes
HTML transferred:       6000000 bytes
Requests per second:    146205.63 [#/sec] (mean)
Time per request:       0.171 [ms] (mean)
Time per request:       0.007 [ms] (mean, across all concurrent requests)
Transfer rate:          10708.42 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:     0    0   0.1      0       2
Waiting:        0    0   0.1      0       2
Total:          0    0   0.1      0       2

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%      0
 100%      2 (longest request)

I also profiled this execution, and the flame graph actually looks quite similar to the memcache example above, unsurprising given the similar workload:

Flamegraph generated by perf script report flamegraph

Microbenchmarks

There is also a suite of microbenchmarks implemented using Google’s benchmark library. They are built by default if that library is found on the build host. From the promise-benchmark suite, we find numbers on the CPU described above in the following ballparks (LTO enabled, Release build - these benchmarks spend no time in the kernel, as everything is scheduled within the uvco library; however, memory allocations for coroutine frames make up much of the overhead):

18ns to call, await, and resolve the trivial yield() coroutine (which ensures that control flow goes through the event loop).
34ns to call, await, and resolve a nested yield(); i.e., calling a coroutine that itself awaits yield(). Combined with (1), this puts the overhead per coroutine invocation at ~18ns.
11ns to yield a trivial value (int) from a MultiPromise<int> generator coroutine. Generators haven’t been described in this article so far; they can efficiently yield multiple values as a generalization of Promise<T>. This benchmark shows that they can save about 30% time compared to constructing a new coroutine for every new value.

Doing these benchmarks, I also managed to confirm some suspicions about inlining and LTO: at first, Promise<int> was faster than Promise<void> by a significant margin. You can still see this by disabling LTO: for point (1) above, the time increases to 27ns for Promise<void>, and to 20ns for Promise<int>. The main difference between the two is that Promise<void> is defined in a .cc file, whereas the generic template implementations are located in a header file. This means that the compiler has a much better inside view of Promise<int> than of Promise<void>.

However, once LTO is enabled, the difference between the two vanishes completely.

As a second data point, the stream-benchmark puts a number on the time required for minimal socket read/write: 2.7µs is what it takes to first write, then read 256 bytes to/from a FIFO created by pipe(2) (respectively, uvco’s uvco::pipe(const Loop&) function). This is astonishingly short, given the usual syscall overhead of > 5 µs. For this benchmark (bmPipeReadWriteBuffer), 50% of time is spent in userspace, and the remainder in the kernel. The following figure shows the corresponding flamegraph:

Flamegraph for the bmReadWriteBuffer benchmark.

Symmetric Hand-off

There is a specific feature in the C++20 coroutine implementation that some call “symmetric hand-off”. When a coroutine is suspended on a co_await or a co_yield, the await_suspend() hook is invoked on the awaitable to define the details of the suspension (and later resumption). Usually, this method just returns a bool, which is typically true, meaning that the coroutine indeed should be suspended. It could also be false, implying that the coroutine should simply continue at this point. However, there’s a third option that I’ve recently implemented in uvco: the method can return another std::coroutine_handle<>, which will then be resumed immediately as the first coroutine was suspended.

The idea is that directly handing off from one coroutine to the next will improve efficiency or latency. For uvco, the use of this behavior can be controlled by the useSymmetricHandoff constant in scheduler.cc. However, I found that activating this feature does not significantly improve throughput, and instead sometimes even worsens it. The standard alternative to using symmetric hand-off is to simply return to the scheduler that had resumed a coroutine, and letting it schedule the next one. Another downside is debuggability: symmetric hand-off results in nested and “ripped apart” callstacks, making profiling almost impossible, such as in this example of running the http-server executable with concurrent connections and symmetric hand-off enabled.

Symmetric hand-off results in nested callstacks.

Next Steps

I’m always looking forward to feedback and adopters of uvco, although it obviously is in an early stage with little useful features. For any serious applications, even I myself would likely consider Rust/Tokio or Python/asyncio, depending on the performance needs. However, if C++ is prescribed, and the only alternative is Boost.Asio - you know where to find uvco ;-)

I would like to further expand uvco and especially build out higher-level features. For example: improve on the HTTP server and add websocket support; expand the supported feature set of the CURL and pqxx integrations; build a small RPC protocol with integrated (de)serialization; consider how to make uvco multi-threaded by default by providing a more flexible executor interface; etc.