Better Post Compromise Security for MLS

This blog post reports on some recent research we’ve been doing at Wickr together with cryptographers at NYU and IOHK. We’ve taken a closer look at the current proposal for the Messaging Layer Security (MLS) protocol – an upcoming open standard under development by the IETF’s eponymous working group. I’ll talk about a weakness we’ve identified in the current design. I’ll also describe the fix we came up with to address the issue (currently under consideration for adoption by the MLS workgroup).

The post assumes some passing familiarity with MLS and its goals. In particular the Forward Security and Post Compromise Security properties it aims for. So, if you like to brush up on that check out my previous blog post introducing MLS.

1. Updates & The MLS Key Schedule.

Put simply, the issue with the current MLS proposal we’ve been looking at in this research project concerns its PCS security; namely how effectively/quickly it heals from past compromise.

At a high level, a group of n users in an MLS group chat session share a distributed session state. The state includes 2n (asymmetric) decryption keys and each user knows a particular subset of log(n) of these keys. This structure allows for very efficiently encrypting data to arbitrary subsets of the group. One way this is used is for “healing”. User’s take turns performing a, so called, “update” operation. This involves choosing fresh random values for their log(n) keys and distributing each new value to the other holders of that key.

A bit more precisely, this is done by encrypting each new decryption key to the subset of all users that know the corresponding old value except for the person actually doing the update. In effect, this means that an adversary that managed to learn Alice’s state (including all her decryption keys) will not be able to decrypt any of the new value’s she sends out when she performs her next update. But everyone else who does know the old keys will be able to use one of their other keys to decrypt the new value distributed by Alice. This mechanism forms the core of the PCS guarantee provided by MLS.

The mechanism also provides some FS as the old keys are overwritten whenever they are updated to a new value. In particular, after such an update, a state compromise won’t reveal anything about their old values so older ciphertexts encrypted to them remain secure.

Epochs. We call the period between any two such updates an “epoch”. In other words, an epoch is characterized (amongst other things) by the fixed values of the 2n keypairs in the session’s distributed sate. Each epoch state also includes a secret value known to all current group members called the “epoch secret”. This is used to deterministically derive a whole host of other (symmetric) keys, nonces and other values. For example, from this epoch secret, users can derive a different symmetric key / nonce pair for each text message sent by each user during the epoch. This deterministic key schedule, rooted at the epoch secret, allows parties to encrypt and send application messages during the epoch without the need for any further synchronization between them. Moreover (just like the symmetric ratchets in a double-ratchet protocol) the key schedule allows receivers to always immediately decrypt incoming messages for the epoch regardless of the order in which they were received and while also maintaining FS for any previously delivered messages.

Of course, as a side effect of these features, the epoch secret is immensely valuable to an attacker; knowing it means knowing all decryption keys for that epoch! Thus, MLS defines an additional mechanism making it harder to derive epoch secrets. An updated by a user results in a packet containing a set of ciphertexts being sent out to all other group members. The protocol guarantees that all group members except for the sender can decrypt at least one of those ciphertexts. Moreover, decrypting any one of those ciphertexts reveals a secret value that permits the receiver to derive the so called “update secret” for the epoch. (The sender trivially knows the update secret as they are the one doing the encrypting in the first place.) Now, to make harder for the adversary to obtain epoch secrets, MLS defines epoch secrets to depend not just on that epoch’s update secret, but also (indirectly) on the previous epoch’s epoch secret. Specifically, from each epoch secret users deterministically derive an “init secret” for that epoch. Once the next epoch is started (i.e. upon the next update) the new epoch secret is the result of an HKDF call that includes both the new update secret and the previous epoch’s init secret as input.

Summarizing from the attacker’s perspective, this means that to compromise all messages sent during a target epoch it’s enough to learn that corresponding epoch secret. Moreover, one way to learn that secret is to learn the target epoch’s update secret as well as the previous epoch’s init secret.

2. PCFS: The Problem with MLS.

Suppose, now that Alice’s state has leaked and that she has just performed an update initiating new epoch j. (For simplicity, we’re assuming here that only Alice was compromised. If that’s not the case then the rest of this discussion applies to the last of the corrupted parties to perform an update.)

Beginning from when Alice’s state was leak and running all the way up to when Alice’s performs her update, the attacker can decrypt essentially anything Alice can meaning there can be no expectation of privacy during this time period. Moreover, Alice’s leaked state must have included the current init secret for that epoch (as she herself needed it to compute the next epoch secret). So, it’s fair to assume the attacker too was able to compute the init secrets all the way up to epoch j-1 (just as Alice can). Now, we said that the update packet starting epoch j was built by Alice in such a way that none of the decryption keys she new pre-update can be used to process the update packet. That means, the adversary does not have the means to derive the new update secret and so cannot compute the new epoch secret either. Thus, at this point we would like to be able to claim that security (e.g. privacy) for all messages sent during epoch j (and later) have been restored; at least until the next compromise.

Recall that, a basic property we expect from any secure message protocol is Forward Secrecy. That is all previously delivered messages should remain private regardless of future (cryptographic) state leakage. Intuitively, if the adversary couldn’t already decrypt the message, then compromising the receiver (after they received the message) shouldn’t help the adversary do so. In the above example, we could reasonably hope for FS for messages sent during epoch j. After all, we established that the adversary cannot recover the epoch secret so without further compromise all encryption keys used in that epoch remain unknown to the adversary.

However, the FS security notion does allow the adversary to compromise other group members after they have received a target ciphertext sent during epoch j. Now, the symmetric key schedule for epoch j (rooted at it’s epoch secret) ensures that, as soon as a ciphertext is decrypted the corresponding decryption key is immediately deleted and can no longer be derived from the receiver’s state. So, at first glance, it looks like we do indeed have FS for epoch j.

But that’s not the whole story. Another way to reproduce all decryption keys for epoch j is for the adversary to learn the update secret distributed in Alice’s update packet. Combined with the previous epoch’s init secret (which the adversary already knows) this allows the adversary to recompute the epoch secret and thus the entire symmetric key schedule for epoch j! How can the adversary learn the update secret? Well, if they manage to leak a state that includes an asymmetric decryption key which can process Alice’s update packet (i.e. decrypt one of its ciphertexts) then the adversary could do just that. In other words, if ever the adversary learns such an asymmetric decryption key then privacy of (all application messages sent during) epoch j is retroactively lost.

We’ll call any such asymmetric decryption key “critical” for epoch j if knowing the key allows the adversary to recover the epoch secret for epoch j. In particular, FS for epoch j can only kick in once all critical keys have been deleted from all group members states.

Too Many Critical Keys & Too Slowly Refreshed. Put succinctly, our analysis of MLS showed that that there are really a surprisingly large number of critical keys lying around in people’s states. Worse, under normal conditions it will likely take an unpleasantly long time before they are all replaced in the group state (and thus deleted from everyone’s local states). What this amounts to, is that once a single compromise has occurred, an MLS session could well take an unreasonably long time to recover FS for subsequent messages even though the compromised party has updated. What’s worse, it’s not all that difficult for an attacker observing only the encrypted network traffic to determine who still has local copy of the critical keys for any given target epoch. This means its all the easier for an attacker to spend their, presumably limited, resources effectively.

Put simply, we observe the following fact about the 2n asymmetric keys in the target epoch’s distributed state. An MLS update packet contains log(n) ciphertexts each encrypted to a different one of those keys. Decrypting any single ciphertext reveals the update secret of the target epoch. Thus, we have at least log(n) “primary” critical keys.

Consider the primary critical key. It turns out that all (but one) of those keys was distributed to other group members as part of some previous update packet. What’s more, some of the keys that could be used to process that previous update packet also allow for recovering the primary critical key distributed in that upate. We’ll call such keys “secondary” critical keys for epoch j because learning them also permits the adversary to recover the update secret for epoch j by first recovering a primary critical key. And this process continues. Most of the “secondary” critical keys were themselves distributed via even early updates. Some of the keys that can process those updates also allow recovering the secondary critical key distributed in those updates making these “tertiary” critical keys. It turns out that this recursion can be repeated up to log(n) making for a total of roughly n critical keys in the distributed group state during epoch j.

The issue is compounded by the fact that most of the asymmetric keys are known to only a very small subset of group members. In other words only very members have the ability to trigger a key refresh. That’s because the only mechanism MLS provides to replace keys is via and updates by a group member that knows the corresponding secret key. To see why this is a problem not that, half the keys are each only known by a single party. A further subset of n/2 keys is each known only to 2 parties. Another 8th of keys are each known to 4 parties, etc. So, if a critical key ends up in one of these subsets (which most do) then we can expect the critical key to remain part of the group state across many (at least O(n)) epochs! Hardly an ideal scenario in terms of achieving FS for our target epoch.

3. A Second Key-Update Mechanism: Updateable Public Key Encryption.

To help get faster FS, we propose a modification to the MLS protocol which introduces a second mechanism through which the asymmetric keys are refreshed. As a critical building block, we use a new type of asymmetric encryption which we call Updateable Public Key Encryption (UPKE). It’s tightly related to the similar concept in this paper on the double-ratchet paradigm.

Basically, UPKE is syntactically similar to standard PKE (like the HPKE scheme currently used by MLS) except that encryption/decryption also re-randomizes the PK/SK, respectively. In more detail, just as in standard PKE, the encryption algorithm takes as input a public key and a message. However, along with the usual ciphertext it also outputs an “updated” version of the public key (meant to replace the previous version of the key). Similarly, the decryption algorithm outputs not just the message but also the “updated” secret key. UPKE guarantees that this key update mechanism is homomorphic in the sense that as long as the original PK and SK formed a valid key pair then so do the updated versions. In an upcoming blog post I’ll talk more about this primitive, covering things like how it can be implemented based on various popular elliptic curve groups (e.g. P-251 and X25519) based on standard libraries (e.g. libsodium, OpenSSL, etc.) and what other practical applications it has outside of MLS such as for hierarchical cryptocurrency wallets and in Tor. For now though, I’ll stick to how we use it to improve on MLS.

Basically, the idea is simply to replace standard public key encryption (specifically HPKE) with a UPKE. (More practically speaking we actually propose an updateable version of HPKE.) The key observation here is that this way, as soon as an update packet is received the simple act of processing it automatically results in replacing the primary critical key that was just used. Clearly, this already helps us because at least that one critical key has now been removed from the group state without the need for anyone to perform another update.

In fact, it actually goes a long way towards improving things. So much so, that we could show that no protocol can recover FS faster in a setting where all parties receive update packets in the same order (though not necessarily at the same time). This is a particular interesting scenario because MLS already requires the delivery service to ensure just such a global ordering on updates as its also required to keep group members local states coherent with each other. (It’s worth noting though that the global ordering assumption is not required for the security of MLS, at least beyond mounting DOS attacks.) In other words, we prove that under this global ordering assumption the new MLS variant is optimal in terms of how fast it recovers FS.

To get an idea for why note that asymmetric keys are distributed in such a way by MLS that for any given update each member has (at most) a single primary critical key with which they can process that update. What about the secondary critical keys? Well this invariant applies recursively to the update that distributed the primary recursive key. That is that party had (at most) a single secondary critical key. What’s more, it was immediately deleted by the UPKE decryption process when that previous update was processed. The same argument holds for the tertiary and all other critical keys. So basically, at the start of each epoch we already see that the only critical key (namely a primary one) is left in the group state and it is removed as soon as the key holders process the update starting the epoch.

4. Where to Find More About This

A great place to find lots more about this project is on its Project Page over at Wickr’s Crypto Research website. I also gave a talk about these and related results at Real World Crypto 2020 which took place in NYC between January 8-10. You can find the slides for that talk here.

If you’d like to know more about the hardcore technical details of this research, UPKE and our new variant of MLS, I suggest having a look at the full version of our writeup over on eprint. In it, you’ll also find a bunch of other results around MLS and the new construction not discussed here. Things like new formal security definitions and proofs for the core part of MLS – called TreeKEM – which encompasses the update mechanism and the management of the asymmetric keys in the group state in MLS.

Finally, keep an eye out for my upcoming (and past) blog posts on UPKE, MLS and other crypto related topics.