The f04cb mystery so far10 Feb 2016
The f04cb41f154db2f05a4a subreddit remains one of the internet’s biggest mysteries. It has been listed amongst the weirdest, creepiest, most insane subreddits ever, and made an appeareance in The Daily Telegraph’s list of the most bizarre subreddits ever created.
In short, the cryptic messages on the subreddit have received a great deal of attention, yet little progress has been made as to what they might mean. So what do we know so far?
- The posts began back in 2012 and comprise of lists of eight character strings: “RoVdTYF5 ReReSYJ1 TIZ1SYN3 SID5RoJb” with numeric titles
- This format was deviated from just twice, when f04cb posted the plaintext “help” and “please help us”
- Also, one of the posts is simply titled zero
- The subreddit’s background used to feature the Reddit mascot with a bleeding eye, but reverted back to default at some point
Theories as to what the messages might signify ranged from links to a Russians lasergun website, speculations about leaked data from the Tehran Stock Exchange, to botnet command & control instructions. My initial guess is as good as anybody else’s – but almost all of these ideas looked like nonsense springing out of pareidolia. People were claiming to find patterns in the data that simply weren’t there.
- The post titles are UNIX timestamps, roughly approximate to the actual post date
- A combination of Base64 decode and Caesar cipher translates the ciphertext into numbers
- The final digit of the timestamp is the Caesar shift value
Still, not everyone agreed that this was the right approach, and the next step to take in the puzzle was totally unclear. At that point, the cryptic posts stopped coming, and with no further progress made, interest in solving the mystery almost completely died out…
…until 20 days ago, when a post following the familiar format appeared. The rebirth of f04cb was echoed over reddit and was noted by a friend, who sent the ciphertext to me. We first read over all pertinent information before attempting to make some headway on this tantalising mystery.
First of all, it was pretty clear that the Caesar + Base64 approach was the right thing to do, particularly because shifting by the last digit of the timestamp unveiled numbers every time. For us, the obvious next step was to quickly hack up some Python to decipher all the posts this way, and then closely inspect the output. First, a Caesar cipher function:
In the posts, only letters are shifted whilst numbers and symbols remain intact, so we didn’t bother with an ordinal transformation. Instead we hardcoded the key and ensured that uppercase characters remained uppercase for the Base64 decode. A few of the posts made the Python Base64 module complain of ‘Incorrect Padding’, which indicated that the user had manually deleted some of the =’s from the ends of the posts – perhaps to disguise the nature of the ciphertext – although this was inconsistently done. We had to add the missing padding with a useful function:
Applying those two functions to the posts, with timestamp[-1] as the second argument to caesar_shift, decoded them all to numbers of varying lengths. One more thing: f04cb’s comments and sidebar lack timestamp titles, so we needed to brute force the Caesar shift value, although ‘brute force’ is an overstatement when describing trying out just ten numbers:
While crawling all of f04cb’s content, we logged the actual timestamp of the posts. This turned out to be a minor item of interest: although it took him or her 174 seconds between generating the first post’s timestamp title and posting to reddit – and over a minute in the next few subsequent posts – the last 4 posts all took less than 8 seconds, with the most recent one a lightning 4 seconds. f04cb is getting faster.
So what we had were about 3500 apparently random numbers, with no patterns to be seen. We tried a few common things on those numbers, such as converting each pair of digits to letters mod 26, which gave us nonsense like this:
It struck us that if the numbers really were randomly distributed, there was not much we could do with them. Knowing that eyes alone cannot be trusted, we performed a chi-square test against the uniform distribution, which confirmed that the numbers are indeed randomly distributed.
Now, this did not necessarily mean that we were at the end of our quest. In cryptography, a completely random ciphertext screams ‘One Time Pad’. The plaintext is found by XORing the ciphertext with a randomly-generated key. The problem is that we don’t know the key. We tried XORing posts with the sidebar text, and with other posts, all to no avail.
We read about the VENONA Project, in which USA cryptanalysts were able to crack a Soviet one-time pad, but only because the Soviets reused the same key, and because of a method called ‘crib-dragging’, which involves painstakingly trying out common phrases until one slots into the correct position. However, since we do not know anything about a hypothetical underlying f04cb plaintext, that method won’t help us. Furthermore, almost all f04cb messages are of different lengths, so looking for some kind of common key would appear a futile effort. If the key were something shorter and guessable, say a word or phrase, then the ciphertext would not be randomly distributed as it is.
Yet the mystery was not over. We got in contact with Reddit user ZtriS, who pointed out an intriguing property of the ciphertext: when converted to hexadecimal, it’s no longer random. We were sceptical at first, so we did a little experiment to check.
First, by reversing the decryption procedure above, we had a program capable of generating ciphertext indistinguishable to that posted by f04cb:
Next, we generated an amount of fake data equal in length to the original f04cb data, and then performed a Pearson’s correlation after decryption and conversion to hexademical. A coefficient above 39 was the 0.999 confidence interval for non-random data, and while our fake data scored 16.3 (as we would expect for random data), the original data scored 126.2. In other words, while the f04cb numbers look random to the human eye, they are in fact far from it.
Indeed, the non-randomness of the decrypted f04cb data in hex is clear from the marked peaks and troughs of the frequency distribution:
But how do we decrypt that hexadecimal data, and is hexademical even the right translation to apply? To our knowledge, this is as far as anyone has got in solving the mystery. If you have any ideas about the next step, then please visit this thread.