For the latest stable version, please use Emilua API 0.10! |
Sandboxes
Emilua provides support for creating actors in isolated processes using Capsicum, FreeBSD jails, Linux namespaces or Landlock. The idea is to prevent potentially exploitable code from accessing resources beyond what has been explicitly handed to them. That’s the basis for capability-based security systems, and it maps pretty well to APIs implementing the actor model such as Emilua.
Even modern operating systems are still somehow rooted in an age where we didn’t know how to properly partition computer resources adequately to user needs keeping a design focused on practical and conscious security. Several solutions are stacked together to somehow fill this gap and they usually work for most of the applications, but that’s not all of them.
Consider the web browser. There is an active movement that try to push for a future where only the web browser exists and users will handle all of their communications, store & share their photos, book hotels & tickets, check their medical history, manage their banking accounts, and much more… all without ever leaving the browser. In such scenario, any protection offered by the OS to protect programs from each other is rendered useless! Only a single program exists. If a hacker exploits the right vulnerability, all of the user’s data will be stolen. There is no real compartmentalisation.
The browser is part of a special class of programs. The browser is a shell. A shell is any interface that acts as a layer between the user and the world. The web browser is the shell for the www world. Www browser or not, any shell will face similar problems and has to be consciously designed to safely isolate contexts that distrust each other. The Emilua team is not aware of anything better than FreeBSD’s Capsicum to do just this. In the absence of Capsicum, we have Linux Landlock which can be used to build something close. Browsers actually use Linux namespaces which are older.
The API
Compartmentalised application development is, of necessity, distributed application development, with software components running in different processes and communicating via message passing.
Robert N. M. Watson, Jonathan Anderson, Ben Laurie, and Kris Kennaway
The Emilua’s API to spawn an actor lies within the reach of a simple function call:
local my_channel = spawn_vm(module)
Check the manual elsewhere to understand the details. As for sandboxes, the idea is to spawn an actor where no system resources are available (e.g. the filesystem is mostly empty, no network interfaces are available, no PIDs from other processes can be seen, …).
Consider the hypothetical sandbox
class:
local mysandbox1 = sandbox.new()
local my_channel = spawn_vm(mysandbox1:context(module))
mysandbox1:handshake()
That would be the ideal we’re pursuing. Nothing other than 2 extra lines of code at most under your application. All complexity for creating sandboxes taken care of by specialized teams of security experts. The Capsicum paper[1] released in 2010 analysed and compared different sandboxing technologies and showed some interesting figures. Consider the following figure that we reproduce here:
Operating system | Model | Line count | Description |
---|---|---|---|
Windows |
ACLs |
22350 |
Windows ACLs and SIDs |
Linux |
|
605 |
|
Mac OS X |
Seatbelt |
560 |
Path-based MAC sandbox |
Linux |
SELinux |
200 |
Restricted sandbox type enforcement domain |
Linux |
|
11301 |
|
FreeBSD |
Capsicum |
100 |
Capsicum sandboxing using |
Do notice that line count is not the only metric of interest. The original paper accompanies a very interesting discussion detailing applicability, risks, and levels of security offered by each approach. Just a few years after the paper was released, user namespaces was merged to Linux and yet a new option for sandboxing is now available. Fast-forward a few more years and we also have Linux Landlock which is even better than Linux namespaces. Within this discussion, we can discard most of the approaches — DAC-based, MAC-based, or too intrusive to be even possible to abstract away as a reusable component — as inadequate to our endeavour.
Out of them, Capsicum wins hands down. It’s just as capable to isolate parts of an application, but with much less chance to error (for the Chromium patchset, it was just 100 lines of extra C code after all). Unfortunately, Capsicum is not available in every modern OS.
Do keep in mind that this is code written by experts in their own fields, and their salary is nothing less than what Google can afford. 11301 lines of code written by a team of Google engineers for a lifetime project such as Google Chromium is not an investment that any project can afford. That’s what the democratization of sandboxing technology needs to do so even small projects can afford them. That’s why it’s important to use sound models that are easy to analyse such as capability-based security systems. That’s why it’s important to offer an API that only adds two extra lines of code to your application. That’s the only way to democratize access to such technology.
Rust programmers' vision of security is to rewrite the world in Rust, a rather unfeasible undertaking, and a huge waste of resources. In a similar fashion, Deno was released to exploit v8 as the basis for its sandboxing features (now they expect the world to be rewritten in TypeScript). The heart of Emilua’s sandboxing relies on technologies that can isolate any code (e.g. C libraries to parse media streams). |
Back to our API, the hypothetical sandbox
class that we showed earlier will
have to be some library that abstracts the differences between each sandbox
technology in the different platforms. The API that Emilua actually exposes as
of this release abstracts all of the semantics related to actor messaging,
work/lifetime accounting, process reaping, DoS protection, serialization, lots
of Linux namespaces details (e.g. PID1), and much more, but it still expects you
to actually initialize the sandbox.
The init.script
Every process carries associated credentials that enable operation on
system-wide addressable objects such as filesystem objects and sockets. We setup
a sandbox by disabling the ambient authority so the address space itself becomes
inaccessible. Sandboxed code thus should be run only after such setup already
completed successfully. The proper hook to perform this setup is
init.script
. init.script
runs right after the process is created.
After the sandboxed actor is up it can receive access to new resources through its inbox. If any security exploit is performed on the sandboxed code, then only the objects it has access to are rendered vulnerable (the damage is thus contained in its compartment).
Landlock (Linux)
local init_script = [[
local rules = C.landlock_create_ruleset{ handled_access_fs = {
"execute", "write_file" "read_file", "read_dir", "remove_dir",
"remove_file", "make_char", "make_dir", "make_reg", "make_sock",
"make_fifo", "make_block", "make_sym", "refer", "truncate" } }
set_no_new_privs()
C.landlock_restrict_self(rules)
]]
spawn_vm{
subprocess = {
init = { script = init_script }
}
}
Landlock as of now can only control access to filesystem objects, but future versions will be more complete.
Implementation details
The purpose of this section is to help you attack the system. If you’re trying to find security holes, this section should be a good overview on how the whole system works. |
If you find any bug in the code, please responsibly send a bug report so the Emilua team can fix it.
Message serialization
Emilua follows the advice from WireGuard developers to avoid parsing bugs by avoiding object serialization altogether. Sequenced-packet sockets with builtin framing are used so we always receive/send whole messages in one API call.
There is a hard-limit (configurable at build time) on the maximum number of members you can send per message. This limit would need to exist anyway to avoid DoS from bad clients.
Another limitation is that no nesting is allowed. You can either send a single non-nil value or a non-empty dictionary where every member in it is a leaf from the root tree. The messaging API is part of the attack surface that bad clients can exploit. We cannot afford a single bug here, so the code must be simple. By forbidding subtrees we can ignore recursion complexities and simplify the code a lot.
The struct used to receive messages follows:
enum kind
{
boolean_true = 1,
boolean_false = 2,
string = 3,
file_descriptor = 4,
actor_address = 5,
nil = 6
};
struct ipc_actor_message
{
union
{
double as_double;
uint64_t as_int;
} members[EMILUA_CONFIG_IPC_ACTOR_MESSAGE_MAX_MEMBERS_NUMBER];
unsigned char strbuf[
EMILUA_CONFIG_IPC_ACTOR_MESSAGE_MAX_MEMBERS_NUMBER * 512];
};
A variant class is needed to send the messages. Given a variant is needed anyway, we just adopt NaN-tagging for its implementation as that will make the struct members packed together and no memory from the host process hidden among paddings will leak to the containers.
The code assumes that no signaling NaNs are ever produced by the Lua VM to simplify the NaN-tagging scheme[2][3]. The type is stored in the mantissa bits of a signaling NaN.
If the first member is nil, then we have a non-dictionary value stored in
members[1]
. Otherwise, a nil
will act as a sentinel to the end of the
dictionary. No sentinel will exist when the dictionary is fully filled.
read()
calls will write to objects of this type directly (i.e. no intermediate
char[N]
buffer is used) so we avoid any complexity with code related to
alignment adjustments.
memset(buf, 0, s)
is used to clear any unused member of the struct before a
call to write()
so we avoid leaking memory from the process to any container.
Strings are preceded by a single byte that contains the size of the string that follows. Therefore, strings are limited to 255 characters. Following from this scheme, a buffer sufficiently large to hold the largest message is declared to avoid any buffer overflow. However, we still perform bounds checking to make sure no uninitialized data from the code stack is propagated back to Lua code to avoid leaking any memory. The bounds checking function in the code has a simple implementation that doesn’t make the code much more complex and it’s easy to follow.
To send file descriptors over, SCM_RIGHTS
is used. There are a lot of quirks
involved with SCM_RIGHTS
(e.g. extra file descriptors could be stuffed into
the buffer even if you didn’t expect them). The encoding scheme for the network
buffer is far simpler to use than SCM_RIGHTS
' ancillary
data. Complexity-wise, there’s far greater chance to introduce a bug in code
related to SCM_RIGHTS
than a bug in the code that parses the network buffer.
Code could be simpler if we only supported messaging strings over, but that would just defer the problem of secure serialization on the user’s back. Code should be simple, but not simpler. By throwing all complexity on the user’s back, the implementation would offer no security. At least we centralized the sensitive object serialization so only one block of code need to be reviewed and audited.
Spawning a new process
UNIX systems allow the userspace to spawn new processes by a fork()
followed
by an exec()
. exec()
really means an executable will be available in the
container, but this assumption doesn’t play nice with our idea of spawning new
actors in an empty container.
What we really want is to to perform a fork followed by no exec()
call. This
approach in itself also has its own problems. exec()
is the only call that
will flush the address space of the running process. If we don’t exec()
then
the new process that was supposed to run untrusted code with no access to system
resources will be able to read all previous memory — memory that will most
likely contain sensitive information that we didn’t want leaked. Other problems
such as threads (supported by the Emilua runtime) would also hinder our ability
to use fork()
without exec()
ing.
One simple approach to solve all these problems is to fork()
at the beginning
of the program so we fork()
before any sensitive information is loaded in the
process' memory. Forking at a well known point also brings other benefits. We
know that no thread has been created yet, so resources such as locks and the
global memory allocator stay in a well defined state. By creating this extra
process before much more extra virtual memory or file descriptor slots in our
process table have been requested, we also make sure that further processes
creation will be cheaper.
└─ emilua program
└─ emilua runtime (supervisor fork()ed near main())
Every time the main process wants to create an actor in a new process, it’ll
defer its job onto the supervisor that was fork()
ed near main()
. An
AF_UNIX
+SOCK_SEQPACKET
socket is used to orchestrate this process. Given the
supervisor is only used to create new processes, it can use blocking APIs that
will simplify the code a lot. The blocking read()
on the socket also means
that it won’t be draining any CPU resources when it’s not needed. Also important
is the threat model here. The main process is not trying to attack the
supervisor process. The supervisor is also trusted and it doesn’t need to run
inside a container. SCM_RIGHTS
handling between the main process and the
supervisor is simplified a lot due to these constraints.
However some care is still needed to setup the supervisor. Each actor will
initially be an exact copy of the supervisor process memory and we want to make
sure that no sensitive data is leaked there. The first thing we do right after
creating the supervisor is collecting any sensitive information that might still
exist in the main process (e.g. argv
and envp
) and instructing the
supervisor process to explicit_bzero()
them. This compromise is not as good as
exec()
would offer, but it’s the best we can do while we limit ourselves to
reasonably portable C code with few assumptions about dynamic/static linkage
against system libraries, and other settings from the host environment.
This problem doesn’t end here. Now that we assume the process memory from the
supervisor contains no sensitive data, we want to keep it that way. It may be
true that every container is assumed as a container that some hacker already
took over (that’s why we’re isolating them, after all), but one container
shouldn’t leak information to another one. In other words, we don’t even want to
load sensitive information regarding the setup of any container from the
supervisor process as that could leak into future containers. The solution here
is to serialize such information (e.g. the init.script
) such that it is only
sent directly to the final process. Another AF_UNIX
+SOCK_SEQPACKET
socket is
used.
Now to the assumptions on the container process. We do assume that it’ll run
code that is potentially dangerous and some hacker might own the container at
some point. However the initial setup does not run arbitrary dangerous code
and it still is part of the trusted computing base. The problem is that we don’t
know whether the init.script
will need to load sensitive information at any
point to perform its job. That’s why we setup the Lua VM that runs init.script
to use a custom allocator that will explicit_bzero()
all allocated memory at
the end. Allocations done by external libraries such as libcap lie outside of
our control, but they rarely matter anyway.
That’s mostly the bulk of our problems and how we handle them. Other problems are summarized in the short list below.
-
SIGCHLD
would be sent to the main process, but we cannot install arbitrary signal handlers in the main process as that’s a property from the application (i.e. signal handling disposition is not a resource owned by the Emilua runtime). The problem was already solved by making the actor a child of the supervisor process. -
We can’t install arbitrary signal handlers in the container process either as that would break every module by bringing different semantics depending on the context where it runs (host/container). To handle PID1 automatically we just fork a new process and forward its signals to the new child.
-
"/proc/self/exe"
is a resource inherited from the main process (i.e. a resource that exists outside the container, so the container is not existing in a completely empty world), and could be exploited in the container.ETXTBSY
will hinder the ability from the container to meddle with"/proc/self/exe"
, andETXTBSY
is guaranteed by the existence of the supervisor process (even if the main process exits, the supervisor will stay alive).
The output from tools such as top
start to become rather cool when you play
with nested containers:
└─ emilua program
└─ emilua runtime (supervisor fork()ed near main())
├─ emilua runtime (PID1 within the new namespace)
│ └─ emilua program
│ └─ emilua runtime (supervisor fork()ed near main())
└─ emilua runtime (PID1 within the new namespace)
└─ emilua program
└─ emilua runtime (supervisor fork()ed near main())
Work lifetime management
For Linux namespaces, PID1 eases our life a lot. As soon as any container starts
to act suspiciously we can safely kill the whole subtree of processes by sending
SIGKILL
to the PID1 that started it.
For FreeBSD’s Capsicum, PD_DAEMON
is not permitted in subprocesses that were
placed into capability mode. If all references to a procdesc file descriptor are
closed, the associated process will be automatically terminated by the kernel.
AF_UNIX
+SOCK_SEQPACKET
sockets are connection-oriented and simplify our work
even further. We shutdown()
the ends of each pair such that they’ll act
unidirectionally just like pipes. When all copies of one end die, the operation
on the other end will abort. The actor API translates to MPSC channels, so we
never ever send the reading end to any container (we only make copies of the
sending end). The kernel will take care of any tricky reference counting
necessary (and SIGKILL
ing PID1 will make sure no unwanted end survives).
The only work left for us to do is pretty much to just orchestrate the internal
concurrency architecture of the runtime (e.g. watch out for blocking
reads). Given that we want to abort reads when all the copies of the sending end
are destroyed, we don’t keep any copy to the sending end in our own
process. Everytime we need to send our address over, we create a new pair of
sockets to send the newly created sending end over. inbox
will unify the
receipt of messages coming from any of these sockets. You can think of each
newly created socket as a new capability. If one capability is revoked, others
remain unaffected.
One good actor could send our address further to a bad actor, and there is no
way to revoke access to the bad actor without also revoking access to the good
actor, but that is in line with capability-based security systems. Access rights
are transitive. In fact, a bad actor could write 0-sized messages over the
AF_UNIX
+SOCK_SEQPACKET
socket to trick us into thinking the channel was
already closed. We’ll happily close the channel and there is no problem
here. The system can happily recover later on (and only this capability is
revoked anyway).
Flow control
The runtime doesn’t schedule any read on the socket unless the user calls
inbox:receive()
. Upon reading a new message the runtime will either wake the
receiving fiber directly, or enqueue the result in a buffer if no receiving
fiber exists at the time (this can happen if the user interrupted the fiber, or
another result arrived and woke the fiber up already). inbox:receive()
won’t
schedule any read on the socket if there’s some result already enqueued in the
buffer.
setns(fd, CLONE_NEWPID)
We don’t offer any helper to spawn a program (i.e. system.spawn()
) within an
existing PID namespace. That’s intentional (although one could still do it
through init.script
). setns(fd, CLONE_NEWPID)
is dangerous. Only exec()
will flush the address space for the process. The window of time that exists
until exec()
is called means that any memory from the previous process could
be read by a compromised container (cf. ptrace(2)).
Tests
A mix of approaches is used to test the implementation.
There’s an unit test for every class of good inputs. There are unit tests for accidental bad inputs that one might try to perform through the Lua API. The unit tests always try to create one scenario for buffered messages and another for immediate delivery of the result.
When support for plugins is enabled, fuzz tests are built as well. The fuzzers are generation-based. One fuzzer will generate good input and test if the program will accept all of them. Another fuzzer will mutate a good input into a bad one (e.g. truncate the message size to attempt a buffer overflow), and check if the program rejects all of them.
There are some other tests as well (e.g. ensure no padding exists between the members of the C struct we send over the wire).