A Response to Recent Claims About Session's Security Architecture

January 16, 2025 / Kee Jefferys

By Kee Jefferys, Session Co-Founder, and Jason Rhinelander, Chief Software Architect, Session Technology Foundation.

We were recently made aware of a blog published by a security researcher which makes a number of claims about Session and supposed flaws in Session’s design and implementation. We, as well as other Session contributors, have now had time to read through the blog and investigate the claims and wanted to give a detailed response on each point raised by the author. 

TL;DR: The supposed security issues raised by the researcher are in all cases incorrect or misleading, their claims are based on either a misreading of Session’s code or a misinterpretation of the underlying cryptography involved. 

The article makes three key claims regarding Session’s security, and we will provide a detailed response to each of these claims in what follows. Additional gripes that do not have any practical impact on Session users are also addressed, along with other issues raised by the author. 

Update (23/01/2025)

The author has published a second blog, responding to the rebuttal provided here. In the second blog, they provide additional context for the issues originally raised. We have added various updates to this blog in the relevant sections to respond to their updated claims and comments. These updates are tagged with the “Update” heading. 

Our goal is to provide clear, factual information about Session's security for its users and the broader community. This will be our final update to this post, however, we continue to welcome security research and encourage researchers to report issues via Session, email, GitHub, or other channels. It’s always great to have more researchers reviewing and independently verifying Session’s code.

Security Claims

Claim 1: Session Ed25519 keys are generated with insufficient entropy and provide only 64 bits of security 

At a high level, we believe this claim is incorrect, but let us walk through the key generation process in Session and explain why.

In Session Ed25519 keys are generated by using a source of entropy to produce a 16 Byte (128 bit) random value and then concatenating those 16 bytes (128 bits) with another 16 bytes (128 bits) of 0’s, this concatenated value, now a total of 32 Bytes (256 bits) is then passed as a seed to libsodium to generate a Ed25519 keypair.

Libsodium implements standard Ed25519, which includes a step to hash the seed value with SHA512 before it is used to generate the keypair; this hashing step ensures any correlations in the seed value are destroyed before being used to derive the Ed25519 private key.

The author claims that this step is irrelevant, and that by knowing that the last 128 bits in the preimage of the hash were initialized to zeros, you can use a modified version of Pollard’s rho with SHA512 and clamping to find the corresponding secret key for a public key in 2^64 queries, reducing security from 128 bits to 64 bits.

The problem with the claim is that Pollard’s rho operates directly on group elements derived from scalars instead of seed values. Even if the seed values have their entropy constrained, the mapping from seed to scalar via SHA512 and clamping makes it infeasible to limit the search space in a way that avoids invalid seeds.

Functionally, this means that the modified Pollard’s rho attack the author proposes provides no reduction in the security level of the keys. 

If the described attack is possible, it should be fairly easy to validate this claim by publishing a PoC showing that this attack is feasible using a smaller amount of entropy to limit computation time. Session developers have been unable to reproduce the author's claimed reduction in security.

Session’s generation of Ed25519 keys using 128 bits of entropy was explicitly identified in  Quarkslab’s audit of Session, and Session developers had similar discussions with the Quarkslab team. Ultimately, they classified this finding as “low” because although the approach was non-standard, there was no practical nor theoretical method found to exploit this non standard approach.

The reason Session performs Ed25519 key generation this way, with 128-bits of entropy instead of 256-bits, isn't just for fun. It's so that Session can use 13-word mnemonic seed phrases instead of 25 word mnemonic seed phrases (called Recovery Passwords in Session). These shorter mnemonic seed phrases are easier for users to write down and save.

Batch attacks

The author also mentions the possibility of batch attacks. However, it's unclear how you could gather enough public keys to meaningfully reduce the security of Session’s keys. Firstly, the number of Session Account IDs is theoretically constrained by the number of Session users, which is likely to grow, but even the largest messaging applications (e.g. WhatsApp) only have around ~3 billion users. Complicating this, there is no simple way to gather the public keys of all users—Session Account IDs are shared out of band, and there is no central location where all Account IDs can be scraped. 

Update (23/01/2025)

The author has published proof of concept (PoC) code which attempts to implement an attack on Session-like key constructions. Session developers have had time to review and analyse this PoC and we have a few comments.

Summary

Let’s start at a high level: The first thing it's important to clarify is that the approach implemented in the PoC by the author does not achieve any reduction in security for individual Session accounts. If attacking a single account, the proposed approach offers no advantage versus a traditional brute force attack against a 128 bit key, something which is impossible given current computing technology. 

However, the author’s PoC code is not focused on achieving security reduction against a single key. Instead, it focuses on achieving a security reduction if an attacker is trying to compromise many keys at the same time and doesn't care which of these keys they crack—what’s traditionally referred to as a multi-target or batch attack. Here, we find that the authors PoC performs worse than a common benchmark algorithm for this type of attack, called a linear search. Given this, we calculate a worst-case scenario for a multi-target attack on Session using linear search, finding that on average it would cost tens of thousands of times more power than the entire planet consumes in one year to compromize even a single Session account—even when the parameters for such an attack are set very favorably for the attacker. The key takeaway here is that Session Accounts remain secure against both specific and targeted attacks and multi-target or batch attacks.

Note: You can jump ahead to the section “What about a worst-case scenario: A Multi-target attack using linear search?” if you want to skip our detailed analysis of the PoC code.

Specific claims

More specifically, the author claimed in their original blog that they could achieve a reduction in security against an individual Session key to 2^64 bits of security. In fact, their approach achieves no reduction in security in this case, nor does it achieve the author’s updated claim of a square root reduction in security against a group of Session keys.

But let's have a look at the PoC code to prove this is the case.

To do this, we can just look at Demo1: it's a small enough search space to run easily, and still has enough variation that we would easily observe if there was a speed-up in cracking compared to the linear search case. Demo2 and Demo3 are just bigger and slower variations of Demo1. Demo4 and Demo5 are redundant parallel versions, as these types of searches are naturally parallelize-able: if I have 1,000 CPU cores available to me, I can easily split the search space into 1,000 pieces and run all the searches at once. As a PoC of a complexity reduction, the issue that needs to be shown is a reduction in the number of steps required, rather than parallelization.

Assessing Demo1

So, what do we get when we actually run Demo1? We ran it 1,000 times, and collected the reported number of steps of the purported exploit approach, and the reported number of steps of the linear search approach. Here are the cumulative results of those 1,000 runs:

  • Demo1 total steps: 152,709

  • Linear search total steps: 256,474

And so, based on this, there appears to be some reduction in complexity:

  • The Demo1 search takes an average of 153 steps 

  • The linear search takes an average of 256 steps.

It's not the claimed square root reduction by any means, but it does seem to be a reduction. It's also worth noting that the 256 average steps for the linear search is already the square root of the search space. That's because we have 256 keys randomly distributed in a key space of 65,536 possible keys. Also note that it isn't 256 steps to find a specific key, but rather just 256 steps (on average) to find a single match of any of the 256 keys we know to exist.

But let's dig a little deeper, because that's not the end of the story. You see, the PoC search version actually stops whenever it encounters any sort of collision, but not necessarily a collision with the targeted keys. Just how often?

  • Demo1 steps: 152,709

  • Demo1 false positives: 270

  • Linear search steps: 256,474

And now the numbers become a bit clearer: linear search steps is the number of steps for a process that always succeeds, but Demo1 steps is the number of steps for an algorithm that fails 27% of the time, which means these outputs are measuring fundamentally different things: one measures an algorithm with a 73% success rate, and the other measures an algorithm with a guaranteed 100% success rate.

If we amended the PoC to start over for such a failure, our average number of steps to find a collision with our target set would be:

152 + .27 * (152 + .27 * (152 + .27 * (152 + .27...

That is, we have to do the (average) 152 tries and then have to start over from a different search start point 27% of the time, and if we land in that case, we still have to start over again with a 27% chance, and so on. All of that can be rewritten as this infinite geometric series:

    152 + (152 * .27) + (152 * .27^2) + (152 * .27^3) + ...

which, mathematically, equals:

152 ÷ (1 - .27) = 208.

What that means is that the PoC here theoretically requires an average of 208 steps to find a collision with one of the 256 known public keys.

That's still a bit lower than the 256 average steps we're seeing with the linear search, though, so let's dig a bit deeper. The linear search is very simple—it does the following operations:

  1. Start at i = 0

  2. If i times the ristretto255 base point is in our set, we're done.

  3. Otherwise increment i and go back to step 2.

Since this keeps going through every possible public key, it will always "crack" one of the keys because it tries every single possible seed value. As we saw above, the average number of steps this requires to find a match is the size of the key space (65,536) divided by the number of known public keys (256):

65,536 ÷ 256 = 256.

So theoretically, this requires 256 steps on average, and in our random sample of 1000 runs above, we found an average of 256.47—pretty much exactly what we’d expect.

If this seems counterintuitive, just remember that we aren’t searching for one key, we’re searching for any of 256 keys. For any given random seed, if we have a probability of 256/65,536 = 1/256 that the random seed is one of the 256 random seeds. With a 1/256 probability of finding a match, it’s not surprising that it takes an average of 256 attempts to find one (just to be extra pedantic here, that’s an extremely close approximation, but not quite exact for a linear search; for sets this large, however, the approximation is extremely close).

Units of Work vs Steps

There’s one key thing that’s worth noting here, and that’s what we define as a "step". Ultimately, any sort of key cracking depends on the number of CPU cycles involved in its computation, and so we need to be sure that we are counting equivalent operations. For instance, if some algorithm A takes half as many "steps" as algorithm B, but each step requires twice as much work, then it is no faster.

The expensive operation here—the unit of work that matters most for the cost of the algorithm—is the base point multiplication (the use of ristretto255 here doesn't fundamentally change anything here; a search in Session's actual Curve25519 public key space requires a similar basepoint calculation). Since the base point multiplication is the singularly most important part of the calculation, it’s what we will consider one unit of work. All the other bits (including the SHA512 hash, in actual Session public key derivations) are tiny compared to the basepoint multiplication, and in some quick benchmarking, are less than 2% of the cost of going from a seed to a public key. More than 98% of the computational time required is the basepoint calculation.

With this in mind, let's consider the PoC provided in Demo1. Here's the actual work of the search (the rest of the code is just dealing with result detection and iteration):

   for ($steps = 1; $steps < $this->getMaxSteps(); ++$steps) {

        [$sk1, $pk1] = $this->iterate($x);

        [$sk2, $pk2] = $this->iterateTwice($y);

Here's the iterate() function:

 public function iterate(string $publicKey): array

    {

        $sk = $this->uniformToBiased($publicKey);

        $pk = sodium_crypto_scalarmult_ristretto255_base(

            sodium_crypto_generichash($sk)

        );

        return [$sk, $pk];

    }

Okay, so there's our basepoint multiplication, i.e. our unit of work.

What about iterateTwice()?

public function iterateTwice(string $publicKey): array

    {

        [,$b] = $this->iterate($publicKey);

        return $this->iterate($b);

    }  

This calls iterate two more times! That means in total, each "step" of the PoC search code is performing three units of work, while the linear search is doing one unit of work per step. It’s also worth noting that, with these three units of work, each step of the PoC search code checks two keys, rather than one, and so we should expect half as many steps.

Now let's go back to our "steps" calculations and reconsider them in terms of units of work (i.e. basepoint multiplications).

  • A Linear search requires an average of 256 units of work to crack a key with a 100% success rate.

  • A search in Demo1 requires an average of 152 steps, each of which checks 2 keys, which means Demo1 is actually trying 304 keys, on average, before it finds a match. These 152 steps each require 3 basepoint multiplications, and so we arrive at 152*3 = 456 units of work, on average, to crack any single key, with a 72% success rate. Extrapolation of how long this would take if we restarted from a different start point estimates about ~620 units of work on average if we were to just keep retrying upon false positive until we got a true positive (i.e. to get up to the same 100% success rate as linear search).

Let’s state that more succinctly: The claimed Proof of Concept requires 2.4 times as much work to find a collision with a targeted key than a simple linear search.

Assessing Demo2

We did the same comparison with the “Demo2” Proof of Concept (though running 100 trials instead of 1000) and observed the following:

  • Demo2 steps: 2,537,765 = 25,377 average steps per run

  • Demo2 false positives: 54

  • Linear search steps: 7,275,324 = 72,753 average steps per run

That is, Demo2 took one-third as many steps, but only succeeded in actually cracking any of the keys 46% of the time. That suggests that, with restarts on false positives, Demo2 would need somewhere around 55,000 steps, on average, to find a match. Remember, though, that each of these steps does 3 times as much work, and so this means 165,000 units of work compared to about 73,000 units of work for the linear search. Keep in mind that these numbers are “noisier” than our Demo1 run due to using only 100 runs instead of 1,000. The linear steps ought to be 65,536, but apparently in these trials we got a bit unlucky. But that means Demo2 is taking somewhere around 2.26 times as much work per cracked key as a linear search, which is pretty close to the same work ratio (2.4) we got in Demo1! So far, both demos appear to be slower versions of a linear search with no asymptotic gain in search complexity.

Assessing Demo3

Demo3 implements an attempt to auto-restart the search upon false positive results, and also cuts the public key size in half. The code as currently published has a fundamental bug in that passing `$allowSelf = false` makes the search take at most two steps before it returns, and as a result it fails to find anything on most attempts to run it. We fixed that problem and ran 10 trials, with the results reported below:

Assessing Demo3 Results

Since we ran only 10 tests, we can expect these results to be fairly noisy, one of those observations looks very much like an outlier—over a million steps for the Demo3 code, and a quite lucky result for the linear search. But even if we take that observation out of our sample, we still get a Demo3 average of ~180,000 steps and a linear search average of ~129,000 steps.

Remember that each step in Demo3 requires three times as much work as each linear step, and so we're looking at more than 4 times as much work to find a result here.

Assessing Demo4 and Demo5

Demo4 simply shows that it is possible to parallelize these operations and does not warrant further discussion. The parallelisation in Demo5 is similarly redundant, and in fact ends up being harmful, by giving misleading results that the author reported in their blog. The problem is this: when this code runs, it runs 128 parallel searches (via GNU Parallel), distributed across CPUs on your system. As soon as any of these 128 jobs finds a result, it then records the number of steps and elapsed time that that job took, and kills all the other jobs without counting their steps at all.

As a result, the reported step values are really not usable, because they aren't the average number of steps required at all, but rather they are what is called a first-order statistic. The thing about a first order statistic is it doesn't tell you anything about the average. It's like trying to find the average height of the population by splitting people into groups of 128 people, measuring the tallest person in each group, and then averaging those heights across multiple groups. This tells you very little about the average height. What's worse, the result is highly sensitive to the size of your group: if we take the tallest person from groups of 1,000, we’re going to get a different result than if we took the tallest person from groups of 10. Worse still, if this script was running on a system with fewer than 128 cores, there are going to be considerable differences in how much CPU time each job actually gets to use as the system needs to juggle jobs across cores, potentially adding even more noise to the results.

We also suspect that the bug we found trying to run Demo3 could also be at play here in seriously underreporting the “poc-steps” values in Demo5, although we have not tried running Demo5 with the bug fixed. The parallelism being added here just obfuscates the results without any asymptotic performance increase. If demos 1-3 don't show any improvement in complexity, then just doing more work in parallel isn't going to either. Instead, it just leads to further obfuscation. As a final point on Demo5, it is worth noting in passing just how much slower the PoC code is compared to a linear search: the average fastest job of the 128 jobs to crack a key (yes, that’s confusing, but the reported times here are a first-order statistic!) took more than 10 times as long as a linear search, despite being implemented similarly in the same programming language. This again suggests there may be a serious bug somewhere in the code or the testing methodology.

What about a worst-case scenario: A Multi-target attack using linear search?

Now that we’ve covered the specific attack proposed in the PoC code, and shown that it’s slower than a multi-target attack using a linear search, we can put the author's implementation to the side. However, it is useful to work out the practical impact of a multi-target attack using linear search on Session keys. 

To reiterate: a multi-target attack achieves no reduction in the operations you need to perform if you are just trying to crack a single targeted Session key. This will cost an average of 2^127 operations. 

A multi-target attack only reduces the number of operations you need to perform if you have a group of public keys and you are happy to compromise any one of those keys but don’t care about which one you compromise. That is, when doing a linear search against a set of keys you have no way to target the specific key you will crack, just that it will be a random one of those keys. Importantly, in multi-target attacks, the probability of compromising a random key from the group increases as the size of the group grows. This incentivises an attacker to gather as many public keys as possible. We have already spoken about the practical difficulty of gathering groups of Session public keys (Account IDs) in the Batching section. But to summarise Account IDs are not stored or accessible in a central location in Session’s architecture, and users typically only have one Account ID which is not rotated over time, unless they create a new Session account.

Nevertheless, let's look at some worst-case scenarios to illustrate how practical or impractical this attack is. Let’s assume Session has more than 4 billion users in the future, larger than the largest userbase of a messaging app (WhatsApp), and an attacker somehow manages to scrape all 4 billion+ users’ public keys. We can use a simple formula to work out the number of operations the attacker would need to perform, on average, to break a single random key. 

Number of operations = 2^k/n

k is the number of bits in the key, and n is the number of public keys the attacker has. Note that this is an approximation that is extremely accurate as long as N is reasonably large, but not if N is very small: that’s why our “crack one single key” number of 2^127 doesn’t follow this formula. Swapping in our hypothesized value of N = 2^32:

2^128/ 2^32

This evaluates to an average of 2^96 operations needed to crack one single random key out of 4 billion+ keys. But that will be a fairly meaningless number to most people, so let's do some math to show how long (and at what cost) these operations would take to run.

How much energy would a multi-target attack using linear search actually require?

Each 2^96 operations is a single elliptic curve multiplication (there are some other operations which the attacker would need to perform but we can ignore those for simplicity). 

We benchmarked the ability to perform 76,000 Session public key calculations per CPU core on a local reasonably modern machine. However, let's assume that you have noticeably faster and more efficient hardware, and were able to get the number of operations up to 200,000 per second while getting power usage down to, say, 5W. In this case you would need an average of ~3.96×10^23 CPU core-seconds or 1.1×10^20 CPU core-hours, which at 5 Watts per core gives us a power consumption result of 550,000,000 Terrawatt-hours (TWh). To give a comparison, the global electricity consumption per year is around 24,500 TWh. So, even if you get millions of these super power efficient machines to perform elliptic curve operations, you will run into physical limits imposed by global energy production. 

Remember, all of this is just to break one single random and unpredictable Session Key out of 4 Billion+ keys. The author claims that this is “definitely” in the reach of a powerful adversary that exists today, but given our calculations, this just isn't true. 

To be fair, the author was assuming feasibility based on an assumption that the problem could be reduced to its square root (2^64), and such a reduction would indeed be an issue (the hypothetical expected power usage to crack one random key in that case would drop to 128GWh). It is, of course, impossible to prove that there is no such possible reduction in complexity, but to the best of our knowledge, no such reduction exists.

Claim 2: The validation process for Ed25519 signatures is “in band”

The author has misinterpreted Session code, and has missed various validation steps which are performed on the senders message and identity in other code blocks. Taken as a whole, the message and sender validation process in Session provides no way to spoof the origin or contents of a message.

From a first principles basis, Session is designed so that any user can message you by knowing a single piece of information, your Account ID. There's no public directory for Account IDs. Instead, they have to be shared and confirmed out of band. This was an explicit design choice, because it means Session doesn’t have the same trust on first use (ToFU) issues that Signal or WhatsApp have where, a user's phone number or username needs to be first be resolved to a public key via a central server who can MiTM attack if you don't confirm safety numbers. 

The purpose of this signature check in Session is simply to validate that the claimed sender address did indeed sign the message you received, and it achieves that goal: this signature is the place where the sender’s identity is proven to eliminate forgeries.

The public key being validated in this check is actually the sender’s Ed25519 public key. There is an additional check to ensure that this ‘outer’ Ed25519 pubkey matches the ‘inner’ X25519 pubkey that you derive from the ‘outer’ Ed25519 pubkey, since a Session Account ID is actually the X25519 public key derived from the senders Ed25519 key, which is also recorded in the encoded message content.

In combination, these checks ensure that the Session Account ID that messaged you cannot be spoofed or faked. This is the only property being achieved here, as message encryption is handled separately from this process. 

Update (23/01/2025)

The author states in their updated blog that their second claim never asserted any security impact. However, this claim was listed under the “Security Issues” heading in their original post, together with other issues where the author did make claims about impact. As a result, we felt it necessary to make it clear to Session users reading about a so-called “Security Issue” that this issue has no practical impact.

Claim 3: Session reuses public keys as private keys for symmetric onion requests

This claim is plainly incorrect, because the author has misinterpreted the code. 

The author mentions the encrypt function and links to the wrong overload of that function.  Instead the one being called is the later overload, of the same name, which takes the public key but then also uses an ephemeral locally generated private key to compute a shared secret. This shared secret, computable only by the request creator and the Service Node, is then passed into the function that the author is linking to.

The code for both encrypt functions, which are side-by-side in the code, can be viewed here

Gripes

Session mnemonic decoding isn’t done in constant-time

Although this finding is correct, it doesn’t have any practical impact. There's no way to remotely force a Session client to run this code and measure the response time, since this code is only run when the user is shown their mnemonic phrase for the first time when generating a new Session account, or when they view their seed phrase manually via the Settings menu.

Performing this operation in non-constant-time only presents a risk if the device Session is running on is compromised via a side channel attack. However, if the device is compromised, then all bets are off and it's likely that the attacker can perform far more damaging attacks by reading messages from the local database or exfiltrating encryption keys directly. Session’s threat model, along with every other end-to-end encrypted messenger's threat model, operates on the assumption that your device isn't compromised. 

Session uses SecureRandom on Android in an unsafe way

As the author notes, this issue only affects ancient versions of Android— Android versions 4.4 (Kitkat) and below—and has long since been patched by the Android team. The oldest version of Android that Session has ever supported (in Session’s 1.0.0 release) was Android version 5 (Lollipop). As of the latest Session Android release (1.20.8) the minimum supported Android version is version 8 (Oreo).

Although this is not a security issue, we do agree that Session should be updated to follow the best practices on Android, which is to use new SecureRandom().

Other Issues

As well as the above issues raised by the author, the author raises a few other issues they have with Session, which we also wanted to cover. 

Session’s removal of PFS

A detailed blog post on why Session removed PFS (Perfect Forward Secrecy), and what that means for users can be found here. In essence, the Signal protocol is not the best choice for every architecture and design, and it is important to understand which security properties the Signal protocol does and does not provide, and how similar properties can be provided via other routes. As the blog post outlines, the Session Protocol is far better suited to decentralized infrastructure, and various elements of Session’s design minimized the risks caused by the removal of PFS.

Ability to force Android clients to run an Argon2 KDF

This code is only run when looking up ONS usernames. It is not possible to force other Session clients to perform this Argon2 KDF at will, since ONS lookups are only ever triggered by explicit user action. This KDF is done to maintain backwards compatibility with old format ONS records; new ONS records do not use Argon2. 

Thinly-veiled accusations of Session being a government-run honeypot 

Not only is this not true, Session also makes every possible effort to remove risk of government interference in three key ways: developers who work on Session have no privileged access to the Session Network; nodes are run by community operators all around the world; and the protocol and apps are completely open source. If the author had reached out to Session developers before publishing their blog post, they would have been able to clarify the reasoning behind specific design decisions and how those design decisions impact security and privacy. 

Join the movement to keep the internet private!

Chat with like-minded individuals in the Session Community.

Friends don’t let friends use compromised messengers.

Sign up to the mailing list and start taking action!