Internals
The target public for this document are C++ programmers who want to delve into the project’s code, not lua users. Native plug-in authors should also read this page. |
The intent of this page is not to detail every internal of the project, but just to give an overview of the architecture. Details change quickly and documentation would lag behind, so they’re avoided.
Once you read it, you should be familiar with the assumptions made thoroughly the project, and how to interact with the native code.
We assume that you already have some familiarity with the lua C API and Boost.Asio.
Multiple lua VMs
The project allows multiple OS threads to call asio::io_context::run()
, so lua
VMs can jump from one thread to another freely, but they will always refer to
the same asio::io_context
and each will be protected by its own ASIO strand.
-- Instantiates a new lua VM that shares
-- the caller's `asio::io_context`
spawn_vm(module)
-- Instantiates a new lua VM in a new
-- thread with its own `asio::io_context`
spawn_vm{ module=module, inherit_context=false }
You must specify a lua module name to run in the new VM, not a function. The module will be loaded and run in the new VM.
The only way for two different lua VMs to communicate is message passing. The
channels are given when you instantiate the extra VMs. The channels accept a
range of different values and will deep-copy them. You can also send references
to IO objects, but the original references will be rendered unusable (their
metatables are unset). Do pay attention to not let objects that have pending
operations to be sent over (EBUSY
, but do create an error code just for that).
Nor synchronization primitives (such as mutex
) nor fiber handles can be sent
over the channels and by implication can’t be used to synchronize (or send
cancellation requests to) fibers running in different lua VMs.
You can also send a channel over a channel. This will only send the channel “address” over and will allow complex routing among the lua VMs. If you send a channel’s rx-end, the other side will receive a tx-channel anyway. On the C++-side, we need to implement a MPSC strand-based channel.
These characteristics should be enough to implement actor patterns. And it is not the job of emilua to enforce good patterns on applications. The patterns can be configured purely in the lua side of coding.
-- Spawn extra threads to the
-- caller's `asio::io_context`
spawn_context_threads(count)
Leaving the actor model aside for a moment, it’s now easy to have threads with
work-stealing (e.g. 8 lua VMs sharing the same asio::io_context
running on 4
threads) so you don’t have to worry about load-balancing.
Inside a single lua VM
When you issue some IO operation (including chan:receive()
), the calling fiber
will suspend, but other fibers from the same lua VM are allowed to kick in
(cooperative multitasking). Fibers can share state with each other safely (and
free from contention problems) as-if the program was single-threaded.
-- Spawn a new fiber on this lua VM
spawn(fn)
You can use the fiber handle just like you’d use a thread handle. There is
join()
, detach()
and interrupt()
.
All sync primitives obey some characteristics thanks to the restrictions we’ve laid out:
-
They always live in the same strand. They never migrate strands.
-
They don’t synchronize with fibers from other strands (except for channels, but that’s another story).
Given these conditions, it’s now easier to implement and reason about the C++ code.
Only the C++ code that suspended the fiber can resume it back. If the operation should be cancellable, the async op should set an interrupter before suspending the fiber. No other code from the runtime will wake this fiber up. Once the interrupter is called, it’ll be cleared automatically to prevent further complications on the async op implementation. The completion handler should also clear the interrupter to make sure it won’t be (wrongly) reused for other operations.
A good level of serialization can be done by exploring these properties and
simplify the implementation a lot. For once, you know no other code will wake
the fiber up, so you can just as well call io_obj.cancel()
on the interrupter
and map asio::error::operation_aborted
to errc::interrupted
on the
completion handler. A single handler (and no other) will take care of waking the
fiber. There is no race to deal with here or anything alike.
A lot of the boilerplate is handled already on the prologue/epilogue functions
from vm_context
.
Userdata practices
Besides the common practices to create custom objects through userdata, Emilua (IO) objects will also:
-
Hide the metatable. By doing that, user code is prevented from changing the metatable (the metatable is just an usual table after all) that native code relies on.
-
Assume
lua_setmetatable()
is an indivisible operation for userdata (i.e. if it fails, it doesn’t set a metatable nor any__gc
metamethod). This assumption is important to simplify object management by getting away with all pre-initialization tricks teached on Roberto’s manuals and associated complexities. -
Assume
lua_setmetatable()
reports errors through exceptions (i.e. it always returns1
). This is a superset of the previous point and it is guaranteed by the VM[1]. We don’t really care as much about this point, but as it is guaranteed, the assumption described in the previous point (which we do care about) is covered as well.
C++ async operations
Let’s begin with require()
.
require()
'ing a module is also an async operation which will suspend the
caller fiber. Every module has its own isolated environment (i.e. a new lua
thread is created for every module and that thread’s environment is configured
to use a separate lua table) sharing the same lua VM. The module’s entry point
is an user-provided source code evaluated to prepare the environment with the
names that should be exported to the caller fiber. But this preparatory step may
not be immediately ready and may need to call other async operations. The rule
we define to mark a module as loaded and ready is when its main fiber finishes
(synchronization code similar to fiber:join()
).
To further enforce a more manageable project layout, it is only allowed to import new modules from the main fiber. This may introduce a “slow” startup in some project layouts, but:
-
It is simpler to reason about the relationship of exported/imported names if we restrict them to the same main fiber. One such use we do of this feature is detecting whether the
inbox
module was loaded and close it if not. -
We are explicitly not aiming for remote modules (e.g. JS running on a web browser), so we don’t need to care about slow startup happening in this event.
-
In the cases where some module startup is indeed slow, the module programmer himself can adopt lazy loading techniques within his module’s functions to have a quick startup with respect to the rest of the application.
Modules evaluate only once and are cached. We never unload them. We keep a reference to their lua thread for as long as the lua VM is active.
Loading a module forms a loader-loaded relationship. This relationship builds a
chain that must be checked when a new module is require()
d (so we can for
instance prevent cyclic imports). But each module will have its own
environment. This means the C++ function that implements require()
needs to
check lua-hidden state associated with the caller lua function (not a global
one). That’s the module system state per-module.
RuleThe per-module state is stored by using the module’s main thread as a key in the fibers table. The fibers table is strong, but this isn’t a problem because the module shall never be unloaded anyway. Code that unrefs fiber coroutines shall check whether the lua thread represents a module and skip removing it from the fibers table if so. |
We can’t store the module system data directly at the thread environment because
lua code can change the thread environment by calling setfenv(0, table)
.
We’ve already gone through the trickiest parts and added the most important restrictions to the table (no lua-related pun intended), so the remaining rules should be quick’n’easy to catch.
When you initiate an async operation, the C++ side will copy the lua_State*
to handle the completion (or cancellation) later. However, any LUA_ERRMEM
will
trigger an emilua-call to lua_close()
and L
may then be invalid when we
later try to resume it. So the completion handler need to check whether the vm
is still valid before accessing it and this is the purpose of the vm_context
structure (also protected by the same strand as the vm).
this_fiber
As long as lua code is executing, there is a current fiber and this property stays unchanged for as long as control doesn’t return to host.
- transparent, adj.
-
Being or pertaining to an existing, nontangible object.
It’s there, but you can’t see it
— IBM System/360 announcement, 1964 - virtual, adj.
-
Being or pertaining to a tangible, nonexistent object.
I can see it, but it’s not there.
— Lady Macbeth
This property is mostly transparent to lua code. Which is to say that the
programmer is aware of this property, but there isn’t a tangible object that it
can track back to this_fiber
. This is mostly true, but there is a quite
tangible this_fiber
lua global object that the user can inspect — exposed at
the beginning of the first thread execution.
However, this_fiber
being a global is shared among all the fibers, so it can’t
point to a single fiber. Instead, it will query which fiber is current and do
operations on it.
C++ async ops will always store which fiber is current to know how to resume
it back. And before a fiber is resumed, this info is stored at a know lua
registry’s index so future async ops will get to know about it too. The reason
why we can’t rely on the L
argument passed to C functions registered at the VM
and the current fiber needs to be remembered is because there will be a L
that
points to the wrong lua thread as soon as the user wraps some function in a
coroutine.
This design works well because we don’t mix responsibilities of the scheduler
with user code (as is the case for Fiber#resume
in Ruby which would be better
suited by a Fiber#spawn()
that accepts post
/dispatch
execution
policies and would avoid the (un-)parking unsound ideas altogether).
Asynchronous event notification
Some events are intrusive and will be generated even when no thread/fiber asked for them. The classical example are UNIX signals. A sighandler must be registered to handle them, but that begs the question: from which thread are these functions called? In the C world there are multiple answers:
SIGEV_SIGNAL
-
The handler will be called asynchronously from any thread. That means a lot of restrictions to what a sighandler can do.
SIGEV_THREAD
-
The handler will be called from an unspecified thread. Now we have way less restrictions, but some still exist (e.g. unsafe thread-local variables and thread cancelability state).
SIGEV_KEVENT
-
The golden standard for event multiplexing in the C world.
Generally the need for asynchronous events spurs from bad design and should be
avoided. However when integrating lua code to existing libraries we must deal
with asynchronous events now and then. Emilua reserves a lua coroutine/thread
for which no suspension is ever allowed and that will give the lua user a mix
between SIGEV_SIGNAL
and SIGEV_THREAD
restrictions. From the handler the
user can notify a condition variable to achieve friction-less handling from a
different fiber similar to what SIGEV_KEVENT
enables.
From the C++ side, one just needs to get the asynchronous event (lua) thread
and rely on lua_pcall()
(no need for complex lua_resume()
handling, nor
fiber APIs).
LUA_ERRMEM
Lua code cannot recover from allocation failures. As an example (and single-VM only):
my_mutex:lock()
scope_cleanup_push(function() my_mutex:unlock() end)
If the VM fails to allocate the closure passed to scope_cleanup_push()
,
my_mutex
will be kept locked and the lua code inside that VM will be in an
unrecoverable state. There’s no pattern or ordering to make resource management
work here as allocation failures can happen almost anywhere and we then inherit
some constraints and reasoning from preemptive scheduling. The only option (and
this applies to any allocation failure reported by the lua VM when running
arbitrary user code) is to terminate the VM from the C++-side.
When lua_close()
is called, there is no guarantee pending operations will be
canceled as they might hold strong references to the underlying IO object
preventing its destructor from getting called. Therefore, the vm_context
structure also holds an intrusive container of polymorphic elements which are
destroyed after lua_close()
is called and can be used to register cleanup code
to avoid such leaks. If the operation finishes, the IO object is free to reclaim
their own objects from this container and use them for other purposes.
lua_CFunction
objects should never call lua_close()
. If they detect
LUA_ERRMEM
all they have to do is to mark the flags field from vm_context
and suspend the fiber. The host will take care of closing lua_State*
and extra
cleanup when it recovers control of the thread.
The other side of the coin is to detect LUA_ERRMEM
. All interactions with
the VM from the C API happens through the virtual stack, so naturally that’s the
first concern. You must not push anything on the stack if there’s no extra free
stack slot available. To check for such slot space, there’s lua_checkstack()
.
The usual C function signature is not enough to convey all the semantics required by the Lua C API. On the Functions and Types section from the manual, we verify the following information:
Here we list all functions and types from the C API in alphabetical order. Each function has an indicator like this:
[-o, +p, x]
[…] The third field,
x
, tells whether the function may throw errors: '-
' means the function never throws any error; 'm
' means the function may throw an error only due to not enough memory; 'e
' means the function may throw other kinds of errors; 'v
' means the function may throw an error on purpose.
The 5.1’s signature for lua_checkstack()
is:
int lua_checkstack(lua_State *L, int extra); // [-0, +0, m]
That’s obviously bogus. If lua_checkstack()
can throw on ENOMEM
that means
there is no possible safe interaction with the VM. That’s — plain and simple — a bug. This bug was fixed in Lua 5.2 when the signature changed to:
int lua_checkstack(lua_State *L, int extra); // [-0, +0, –]
Lua 5.2 received a few other improvements concerning ENOMEM such as
obsoleting lua_cpcall() by introducing light C functions. API-wise, Lua 5.2
was a great release as it fixed many shortcomings.
|
You don’t always need to call lua_checkstack()
before doing anything thanks
to at least LUA_MINSTACK
free stack slots being guaranteed for you when the VM
calls into your lua_CFunction
objects. And here’s where things start to get
tricky. Consider the following Lua code:
coroutine.wrap(function()
spawn(function()
print('Hello World')
end)
end)()
The underlying C function implementing spawn()
is exposed to 3 different
lua_State*
handles:
- Current fiber
-
get_vm_context(L).current_fiber()
. The one that callscoroutine.wrap()
. - Inner coroutine
-
The
L
parameter fromlua_CFunction
. The one that callsspawn()
. - New fiber
-
lua_newthread(L)
return value. The one to print “Hello World”.
If lua_error()
is called on L
, the stack for L
will be in a completely
deterministic state. Anything this lua_CFunction
object pushed on the stack
will be popped and the whole pcall()
-chain on the state L
will be
respected too. However lua_error()
might be called indirectly through other
API functions. That’s the signature for lua_newtable()
:
void lua_newtable(lua_State *L); // [-0, +1, m]
As we’ve seen previously:
'
m
' means the function may throw an error only due to not enough memory
“Throw” here means sorts of a call to lua_error()
(LUAI_THROW
to be more
accurate). That’s the pcall()
-chain and each lua_State
has its own (this
property won’t change even if you compile the Lua VM as C++ code). This
independent pcall()
-chain for each lua_State
is not a limitation from the C
API, but an accurate model of the underlying machinery happening in Lua code
itself. Consider the following snippet:
c1 = coroutine.create(function()
pcall(function()
-- ...
end)
end)
If c1
is suspended in the middle of pcall()
, it retains this private
pcall()
-chain that doesn’t get mixed with pcall()
-chains from other
coroutines (i.e. the other lua_State*
handles). Therefore the C API accurately
maps the language behaviour on retaining a private pcall()
-chain for each
lua_State
and we can’t expect any different behaviour here really. Lua
documentation on the issue has been ironed out little-by-little throughout its
releases. Lua 5.3 was the one to finally explicitly state the behaviour we just
described:
The panic function, as its name implies, is a mechanism of last resort. Programs should avoid it. As a general rule, when a C function is called by Lua with a Lua state, it can do whatever it wants on that Lua state, as it should be already protected. However, when C code operates on other Lua states (e.g., a Lua argument to the function, a Lua state stored in the registry, or the result of
lua_newthread
), it should use them only in API calls that cannot raise errors.
In short, that means our spawn()
implementation that is exposed to the {L
,
current fiber, new fiber} triple would throw to the wrong pcall()
-chain if it
calls lua_newtable(new_fiber)
. The solution is to use lua_xmove()
when
necessary and maintain rigorous discipline as to which C API functions are
called on “foreign” lua_State*
handles paying very special attention to their
respective throw specifications. As for the discipline required,
Rici Lake wrote a
good summary on the lua-users wiki:
There are quite a number of API functions which will never throw a Lua error. API functions that throw errors are identified in the reference manual as of 5.1.3. First, none of the stack adjustment functions throw errors; this includes
lua_pop
,lua_gettop
,lua_settop
,lua_pushvalue
,lua_insert
,lua_replace
andlua_remove
. If you provide incorrect indexes to these functions, or you haven’t calledlua_checkstack
, then you’re either going to get garbage or a segfault, but not a Lua error.None of the functions which push atomic data —
lua_pushnumber
,lua_pushnil
,lua_pushboolean
andlua_pushlightuserdata
ever throw an error. API functions which push complex objects (strings, tables, closures, threads, full userdata) may throw a memory error. None of the type enquiry functions —lua_is*
,lua_type
andlua_typename
— will ever throw an error, and neither will the functions which set/get metatables and environments.lua_rawget
,lua_rawgeti
andlua_rawequal
will also never throw an error. Aside fromlua_tostring
, none of thelua_to*
functions will throw an error, and you can avoid the possibility oflua_tostring
throwing an out of memory error by first checking that the object is a string, usinglua_type
.lua_rawset
andlua_rawseti
may throw an out of memory error. The functions which may throw arbitrary errors are the ones which may call metamethods; these include all of the non-rawget
andset
functions, as well aslua_equal
andlua_lt
.
On a side note, Lua 5.2 added the following:
If an error happens outside any protected environment, Lua calls a panic function (see
lua_atpanic
) and then callsabort
, thus exiting the host application. Your panic function can avoid this exit by never returning (e.g., doing a long jump to your own recovery point outside Lua).The panic function runs as if it were a message handler (see §2.3); in particular, the error message is at the top of the stack. However, there is no guarantees about stack space. To push anything on the stack, the panic function should first check the available space (see §4.2).
That’s actually behaviour that already existed on the version 5.1. An
alternative panic function could just throw a C++ exception to implement this
__attribute__((noreturn))
behaviour. However this hypothetical panic
function is not an alternative solution to our problems due to the combination
of the following facts:
-
As described elsewhere in this document, we require
lua_error()
to act as-if it throws a C++ exception so our destructors are properly called. That requires the underlying Lua VM (LuaJIT in our case) to throw and catch C++ exceptions. -
A C++-throw is triggered from
lua_newtable(L)
. The type thrown here is internal to the Lua VM and we cannot throw it ourselves.LUA_ERRMEM
information is correctly preserved. -
A panic is triggered from
lua_newtable(new_fiber)
. Our panic function would in turn discardLUA_ERRMEM
and throw a generic C++ exception. -
On
lua_newtable(new_fiber)
hittingLUA_ERRMEM
, theL
's C++-catch handler wouldn’t receive the original error (LUA_ERRMEM
). That means information loss. That means our host code (the code that first calls into the Lua VM) won’t calllua_close()
(when it should) as itslua_pcall()
/lua_resume()
call might not report the correct error reason (LUA_ERRMEM
). That also means the possibility to unwind the wrong number of cascadedpcall()
blocks (apcall()
from Lua code is not supposed to handleLUA_ERRMEM
— if correctly detected — so the number of blocks unwinded differs wheneverLUA_ERRMEM
is involved). -
Although LuaJIT can catch generic C++ exceptions, it lacks context and cannot possibly restore the stack state on each lateral
lua_State*
handle at play (the triple {L
, current fiber, new fiber} in our case). If thespawn()
lua_CFunction
had a value pushed on thecurrent_fiber
stack when anew_fiber
panic-triggered exception raises, the value on thecurrent_fiber
stack wouldn’t be properly popped by the timeL
handles the C++ exception (and do remember thatL
is executing nested on top ofcurrent_fiber
so you can already imagine the chaos here). In short, the Lua VM needs our cooperation to maintain some invariants. -
By wrapping these calls into our own C++ catch blocks we could work around some of these issues, but the thought that thread control would still return to the Lua VM one last time after the panic handler got called is just too scary and previous mailing list threads on this topic weren’t very reassuring. For one, if the exception is panic-triggered by
current_fiber
, we won’t know what remains on this stack (except for the stack top), but that’s exactly thelua_State
that the host is operating on when ourlua_CFunction
got called onL
. Even if control does return safely to our host it would still have problems to deal with there.
That covers our policy when implementing lua_CFunction
objects. In short, we
cannot resort to Lua panics here and the only real solution is the rigorous
discipline on C API usage mentioned earlier.
Now let’s talk about our policy for host code. The Lua suspending IO functions
are implemented by querying which fiber is current and scheduling a
lua_resume()
on it as the callback for some Boost.Asio supported C++
async_*()
function (plus a ton of other details properly documented elsewhere
on this document such as strand handling and so on). The initiating function is
called from the Lua VM, but the callback is not. The callback will act as the
host.
Back to lua_resume()
, this function itself doesn’t throw:
int lua_resume(lua_State *L, int narg); // [-?, +?, –]
However the code that runs before lua_resume()
might throw. This is the code
that pushes the arguments to the coroutine. For instance, if a string is one of
the coroutine parameters, you will have to use C API that might throw on
ENOMEM
:
void lua_pushlstring(lua_State *L, const char *s, size_t len); // [-0, +1, m]
It’s no use trying to call lua_pcall()
to wrap lua_pushlstring()
here. lua_state()
now returns LUA_YIELD
and that means you can’t use
lua_pcall()
on this lua_State*
handle. You can’t create a new handle and use
the lua_xmove()
trick either as lua_newthread()
itself can throw on
ENOMEM
:
lua_State *lua_newthread(lua_State *L); // [-0, +1, m]
Fear not, for here is the place where we can finally use a panic function to
throw a custom C++ exception. There are only two caveats. The first one is
related to
LuaJIT
having such tight integration with native exceptions that it makes (almost) no
distinction between lua_pcall()
and C++ catch frames[2]. The
net result is that you can use C++'s catch-all blocks and then no panic
function will ever be involved (by now you must be feeling that we just
travelled to the farthest candy shop in the kingdom just to make a full-turn
just one block away from destination when we changed our minds and decided to go
on the neighbour’s candy shop). Despite the lack of a real panic function
throwing our own exceptions, I’ll still use the same previous terminology
(i.e. panic-triggered exceptions).
The second caveat is a little charming race to avoid. The completion handler doing the host job is executed through the strand that protects the VM. If we let the exception escape the completion handler, another thread might try to use the VM before we have the chance to close it. In other words, the following approach has a race and thus is not used:
for (;;) {
try {
// Completion handler allows the panic
// exception to escape here.
ioctx.run();
break;
} catch (...) {
// This is a bug. This code isn't executed
// through the VM strand. A pending operation
// that just finished could try to access
// `current` from another thread while we're
// here.
vm_context* current = ...;
current->close();
continue;
}
}
Therefore, it is responsibility from the completion handler to handle the panic-triggered exception (sorry about the boilerplate on your side, but that’s the way it is).
try {
// lua_push*() calls
} catch (...) {
vm_ctx->close();
return;
}
int res = lua_resume(fiber, narg);
That is enough to cover the policy for host code and finally finish the
LUA_ERRMEM
discussion too.
Channels and resources
The biggest challenge to cross-VM resource management are the multi-strand sync primitives (i.e. the channels). They have to execute code that jumps from one strand to another to finish their jobs. If the associated execution context already finished, then they would be stuck forever. The solution is for them to keep the execution context busy through a work guard.
However some rules are needed to make this work:
-
Rx-channels (i.e.
inbox
) don’t keep work guards. -
Tx-channels keep a work guard to the other end while they are alive. But they only keep a work guard to their own strands when they have an active operation.
If the tx-channels are not closed, they will prevent execution contexts that are no longer necessary from being destroyed. But that’s the best we can do. We could periodically call the GC to free unused channels, but so will lua code anyway and there’s nothing left for us to do on the C++ side. A good practice for lua code would be to add the following chunk at the beginning of the fiber who’s gonna process the actor messages:
scope_cleanup_push(function() inbox:close() end)
Extra rules for channels management:
-
As an extra safety measure, if the main fiber finishes and
inbox
wasn’t imported, the runtime closes it. -
Channels (tx and rx) also get closed when the VM is terminated.
-
Channels must only upgrade their weak references to
vm_context
once they migrated to the target strand. Otherwise, they would prevent the VM from auto-closing (and hairy problems would follow).
The exception mechanism
C++ exceptions must not be used to propagate errors across lua/C++
frames. However, lua errors may simply trigger stack unwinding (the code makes
heavy use of setjmp()
) and we do depend on RAII to keep the code correct.
It is assumed that any call to lua_error()
will behave as-if it throws a C++
exception (thus triggering our destructors). We require some support from the
luaJIT VM for this. Specifically, we can’t rely on
the “no interoperability” category
from their “exception” section on the “extensions” page because the following
restriction:
Throwing Lua errors across C++ frames will not call C++ destructors.
To make matters worse, the feature we do depend on only appears in the the “full interoperability” category:
Throwing Lua errors across C++ frames is safe. C++ destructors will be called.
A different approach would be to implement an exception mechanism in terms of coroutines (although it’d add to code complexity):
Exceptions < Coroutines < ContinuationsExceptions can be thought of as a subclass of coroutines. You can implement an exception mechanism with coroutines.
leafo.net
But this path would be a dead-end as native lua errors would still be reported
through lua_error()
. For luaJIT, lua_error()
plays well with our code
because:
The LuaJIT VM is fully resumable. This means you can yield from a coroutine even across contexts, where this would not possible with the standard Lua 5.1 VM: e.g. you can yield across
pcall()
andxpcall()
, across iterators and across metamethods.
Wasn’t for this guarantee, the project would be monstrous. To understand why this guarantee is important, let’s unravel the fundamental pattern for fibers support. We always implicitly wrap every user code inside a lua coroutine:
local fib = coroutine.create(user_fn)
So async operations can suspend the calling fiber and resume them later.
But user_fn
might very well contain a pcall()
and execute our suspending
async function inside it:
function user_fn()
pcall(function()
io_obj:emilua_async_op()
end)
end
The exception mechanism should not block our ability to suspend fibers. When our
own native code calls lua_yield()
to suspend a fiber, the suspension mechanism
should be able to cross the pcall()
barrier.
To wrap all up so far, the standard lua exception mechanism is used to report
errors. The only difference is that emilua will lua_error()
a structured error
object inspired by std::error_code
for our own errors.
Things would get a little tricky on the following point that we raised previously though:
[…] and we do depend on RAII to keep the code correct.
Imagine we have some code like the following:
class reference
{
public:
reference() : L(nullptr) {}
reference(lua_State* L)
: L(L)
, idx(luaL_ref(L, LUA_REGISTRYINDEX))
{}
~reference()
{
if (!L)
return;
luaL_unref(L, LUA_REGISTRYINDEX, idx);
}
reference(reference&& o)
: L(o.L)
, idx(o.idx)
{
o.L = nullptr;
}
lua_State* state() const
{
return L;
}
void push() const
{
assert(L);
lua_pushinteger(L, idx);
lua_gettable(L, LUA_REGISTRYINDEX);
}
private:
lua_State* L;
int idx;
};
If an object of this type has its destructor called on lua_error()
-triggered
stack unwinding, it means we’re manipulating the lua_State*
(luaL_unref(L)
in this example) on stack unwinding (i.e. outside of a lua-catch block which
would be just after a pcall()
return). If the VM is not in a safe state for
manipulations at this moment (this scenario just doesn’t happen if you stick
with plain C which is the target lua was developed for) then we’re
screwed. Luckily, the VM can handle such situations just fine as it is hinted on
the luaJIT documentation:
static int wrap_exceptions(lua_State *L, lua_CFunction f) { try { return f(L); // Call wrapped function and return result. } catch (const char *s) { // Catch and convert exceptions. lua_pushstring(L, s); } catch (std::exception& e) { lua_pushstring(L, e.what()); } catch (...) { lua_pushliteral(L, "caught (...)"); } return lua_error(L); // Rethrow as a Lua error. }
Recommended usage pattern for
LUAJIT_MODE_WRAPCFUNC
This guarantee is promised again (although this version of the promise is read-only) in their “extensions” page (and again only at the full interoperability category):
Lua errors can be caught on the C++ side with
catch(…)
. The corresponding Lua error message can be retrieved from the Lua stack.
The final piece for our puzzle is related to async ops converting
std::error_code
into lua exceptions (i.e. lua_error()
). The completion
handler for async ops is not called in a lua context, so they cannot just call
lua_error()
and hope the correct context will catch the exception (there’s no
API similar to
resume_with()
from Boost.Context). They need to return control to the native code that
suspended the fiber so it can throw a lua exception before control returns to
lua code.
This guarantee used to exist on luaJIT 1.x (which included Coco):
Now, if the current coroutine has an associated C stack,
lua_yield()
returns the number of arguments passed back from the resume.
The lack of allocated C stacks brings more complications to the implementation
that will be discussed
later. lua_yieldk()
from Lua 5.2 would be enough for us (and cheaper!),
but we don’t have that either.
Yet another option would be to set an one-time hook to be called immediately just before resuming the lua coroutine, but it’d present challenges in the future if we ever add debugging support, so it is avoided.
And the solution Emilua get away with is wrapping the C function inside a lua function. The C function returns a 2-tuple. If the first argument is not nil, the lua function itself will take care of use it to raise an error.
local error, native = ...
return function(...)
local e, v = native(...)
if e then
error(e)
else
return v
end
end
User-coroutines
Let’s jump straight to a topic that gives some sense of continuity to the
previous section. The pcall()
barrier is not the only barrier that the user
can insert to prevent lua_yield()
from suspending the fiber. The user might
very well just wrap calls using coroutine.create()
:
function user_fn()
coroutine.create(function()
io_obj:emilua_async_op()
end)
end
RuleLua’s |
The problem is solved by exposing a different coroutine
module — a small shim
over the original one. This version inspects this_fiber
's suspension reason
(native code or lua code).
Conceptually, the implementation looks like this:
function coroutine.resume(co, ...)
if _G.busy_coroutines[co] then
-- CORUN
error("cannot resume running coroutine", 2)
end
local args = {...}
while true do
local ret = {raw_coroutine.resume(co, unpack(args))}
if ret[1] == false then
return unpack(ret)
end
if _G.this_fiber.native_yield then
_G.busy_coroutines[co] = true
args = {raw_coroutine.yield(unpack(ret, 2))}
_G.busy_coroutines[co] = nil
else
return unpack(ret)
end
end
end
function coroutine.yield(...)
if _G.fibers[raw_coroutine.running()] ~= nil then
error("bad coroutine", 2)
end
return raw_coroutine.yield(...)
end
function coroutine.status(co)
if _G.busy_coroutines[co] then
return "normal"
end
return raw_coroutine.status(co)
end
function coroutine.running()
local co = raw_coroutine.running()
if _G.fibers[co] ~= nil then
-- Fiber's coroutines work just like the main coroutine
return nil
end
return co
end
coroutine.create = ...
coroutine.wrap = ...
Dead fibers
When an exception escapes the fiber stack, the hook registered with
sys.set_uncaught_hook()
is called. The default hook prints the stack trace to
stderr
and additionally terminates the VM if the exception escaped from the
main fiber. If the custom hook itself fails, the default hook is then called
anyway.
Scope handlers are properly popped and called after the hook returns control of the thread to the runtime.
The hook is only called for detached fibers. Therefore, a different behaviour
can be chosen for each join()
ed fiber. Also, if the fiber isn’t explicitly
detach()
ed, the hook action will be deferred until some GC round.
There isn’t a pcall
block around the whole program. lua_resume
is enough and
it has the nice property of not unwinding the stack so it can be examined from
the error handler. A new lua thread is created to execute the uncaught-hook
while it has the chance to examine the unchanged error’ed call stack.
The hook mechanism isn’t implemented yet. |
Functions that receive a lua callback
There are plenty of functions that have a lua closure as a parameter
(e.g. pcall()
, scope()
, …). If we blindly implement them in plain C, they
will configure a non-leaf C stack frame which we cannot suspend.
To avoid the C stack frame in the middle of the call-stack altogether, we implement (parts of) these functions in lua, not C. The problem is then how to expose sensitive raw resources that the C functions would use. One of the goals is to not let these resources escape elsewhere.
A quick way to achieve it is by having a lua bootstrap function/chunk to create closures and later change their upvalues through C:
local private_resource = ...
return function()
-- use `private_resource`
end
This approach is naive as luaJIT 2.x does not implement some lua functions
(i.e. the sensitive raw resources that we want to keep private) as C functions
and we cannot feed them as upvalues for the imported bytecode. For instance, we
have this behaviour for pcall()
:
lua_pushcfunction(L, luaopen_base);
lua_call(L, 0, 0);
lua_getglobal(L, "pcall");
lua_CFunction pcall_addr = lua_tocfunction(L, -1);
assert(pcall_addr == nullptr); // :-(
Therefore the lua bytecode won’t be a closure with uninitialized upvalues per se, but a function that receives the private resources and returns the needed closure. It is an extra step on startup, but at least we save some cycles by compiling the bytecode with stripped debug info in the project build stage.
Process environment
A part of the process environment (e.g. UNIX signals) should be under complete control of the program and no external library should meddle with it. However, no protections will be provided to enforce this good practice.
VM settings inheritance
New actors should inherit generic customization points for the GC (e.g. step count and period) and the JIT. They should also inherit allocator settings, but they must not be prevented from creating new actors with higher allocation quotas (unless of course the global pool is already at its limit).
Lua 5.2/LuaJIT extensions
We use some C functions found only on Lua 5.2+ and/or LuaJIT:
-
luaL_traceback()
-
luaopen_bit()
-
luaopen_jit()
-
luaopen_ffi()
2GB addressing limit
luaJIT
has a serious 2GB limit that has been
fixed
on forks. By default, the broken 64-bit addressing mode is hidden behind
LUAJIT_ENABLE_GC64
. Emilua might consider moving to
moonjit
if its author don’t try to part away from the lua 5.1 core and keep himself
distant from 5.3+ syntactic explosion madness. I don’t like this C++-like
culture expanding to lua or other languages (kudos to Go here for avoiding it).
JIT parameters
The JIT parameters are also changed from the old defaults:
maxtrace=1000
maxrecord=4000
maxmcode=512 -- in KB
maxtrace=8000
maxrecord=16000
maxmcode=40960 -- in KB
Locales
A recent POSIX standard specified anemic per-thread and per-function locale support, but, aside from this anemic support, C uses the same locale globally for the whole process.
Meanwhile, C++ has somewhat usable support for multiple locales per process (and an extra global one that also affects the global C locale).
Functions such as perror()
and strerror()
will query LC_MESSAGES
from the
global C locale. However the sole function to query this attribute — setlocale()
— is not thread-safe so we shouldn’t change the locale after the
program starts and minimal initialization to the process state is done. Changing
the global locale is highly unsafe and such API will not be exposed to Lua code.
The thread-safe C++ locales export functionality for LC_MESSAGES
through the
facet std::messages
. This facet allows one to open system-defined message
catalogs, and get translation messages for them. This facet exposes no
equivalent for the query setlocale(LC_MESSAGES, NULL)
. Even if we query it at
the beginning of the program and try to attach a new custom facet to the global
locale object, this will create a nameless locale. Unnamed global C++ locales
will break LC_MESSAGES
for the C ecosystem (e.g. perror()
will no longer
print localized messages). Therefore custom facets are out of question.
A direct call to setlocale(LC_MESSAGES, NULL)
is avoided too because ISO C++
doesn’t define the macro LC_MESSAGES
. To query the current LC_MESSAGES
we
just look for LC_MESSAGES
in the current C++ locale’s name. This approach
doesn’t interfere with the C ecosystem, and also paves the way for multiple
per-process locales.
One can find the list of POSIX environment variables that affect the process' locale at https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02. The format for these variables is defined as:
[language[_territory][.codeset][@modifier]]
This format is compatible with RDF’s Turtle where LANGTAG
is defined as:
LANGTAG ::= '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
And it matches the semantics for BCP47 definition:
obs-language-tag = primary-subtag *( "-" subtag )
primary-subtag = 1*8ALPHA
subtag = 1*8(ALPHA / DIGIT)
The registry of subtags is maintained by IANA at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry.
So LC_MESSAGES=pt_BR
becomes Turtle’s "literal"@pt-BR
(and at least the
subtag is case sensitive).
A Turtle language-tagged string ceases to be of the datatype http://www.w3.org/2001/XMLSchema#string. Its datatype will be http://www.w3.org/1999/02/22-rdf-syntax-ns#langString. If this is a problem for your application, do not use Turtle language-tagged strings. |
For more information about C++ locales, the following links are relevant:
Open questions
-
Describe the behaviour for
sys.exit()
(for main and secondary VMs). Should it call the cancellator for every active operation? Should it exit the application?
Extra caution to take when writing plug-ins
Always keep in mind:
-
If you enable your IO object to be sent over channels, it’ll also be able to migrate to a different
asio::io_context
and you must take care to keep a work guard to the originalasio::io_context
. -
Pending operations must hold a strong reference to
vm_context
and a work guard — directly or indirectly — tovm_context.strand()
. -
IO objects (channels included) by themselves must not hold any strong references to their own
vm_context
(this cycle would prevent auto-closing the VM and associated channels). Operation initiation is the perfect time to upgrade weak references (if any) to strong ones. -
Pending operations must not trust
L
from the initiating operation to decide which fiber to wake-up later on. They must resort — at initiation time — to thevm_context
API. Check the simplesleep_for()
implementation for a code template.
Final note
Emilua software is complex. There should be no pursuit in indefinitely extending this base. Rather, we should search for stabilization and maturity (and also tooling around a solid base).
If you think there should be a nice lua library to handle IRC and what-not, by all means do write it, but write it as a separate lua library (or native plug-in), and compete against the free market of libraries. Do not submit a proposal to integrate it in the core. There are no batteries included. And there shall be no committee-driven development.
Likewise, we should be stuck in the current lua syntax (5.1 plus some extensions found in the beta branch of luaJIT 2.1[3]) forever. If you want more syntax, use a transpiler.