/InsideReCaptcha

Reverse-engineering the new “captchaless” ReCaptcha system...

Primary LanguagePython

Summary

A few days ago, Google has introduced a new version of ReCaptcha, theorically allowing most users to complete it by only ticking a checkbox. If the user isn't deemed as human by Google, the old version with distorted text appears. Although I used a normal Firefox version, I still had to fill the text captcha after clicking, so it didn't really worked for me. My curiosity induced me to look at the JavaScript in order to know how all this really works...

What happens on the wire

First, the browser makes the few following requests:

  • https://www.google.com/recaptcha/api.js, whose function is mainly to load the next one...
  • https://www.gstatic.com/recaptcha/api2/r20141202135649/recaptcha__en.js, which contains common code.
  • https://apis.google.com/_/scs/apps-static/_/js/ (followed by a bunch of more or less cryptic parameters) which contains other common JavaScript code.

The browser then makes a requests to https://www.google.com/recaptcha/api2/anchor, whose response contains the very interesting stuff: a callback to a function called recaptcha.anchor.Main.init, which contains two base64-encoded parameters.

The first parameter points to a JavaScript file: https://www.google.com/js/bg/6yg-ggdQgQAg8SAADJkAjc-JMNnOnYuIGgH_iBV7uf8.js. The second one contains *double-*base64-encoded binary data.

It turned out this new ReCaptcha system is heavily obfuscated, as Google implemented a whole VM in JavaScript with a specific bytecode language.

The first parameter is the bytecode interpreter. After trimming the (function(){eval(' and ')})(), and passing it to JSBeautifier, I finally dove in this mass of minified code.

The analysis

The interpreter has two entry points: the M function which is executed when ReCaptcha is loaded, and M.prototype.ha which is executed when you click the checkbox, and that returns the information for Google servers.

I first discovered that the bytecode was encrypted using the XTEA algorithm. Each block of 8 bytes is xored with a keystream (so decryption and encryption functions are the same), where the first 32-bit word of plaintext is read from the bytecode file, the second 32-bit word is the position in the bytecode file divided by 8, and the key is by default [0, 0, 0, 0].

By default... because it would have been too simple: it turns out the bytecode has direct access to JavaScript variables of its own interpreter, and changes its own decryption key and even its own opcodes numbers at many points.

Even more nifty, the bytecode key is once generated by directly hashing JavaScript code from the interpreter (Function.toString() rocks, it doesn't?), or with the output of browser-specific functions and CSS rules, or with the hostname of the calling domain (www.google.com)...

After about 2 days of work, I produced a working disassembler and then decompiler for the ReCaptcha bytecode. You can try it from this GitHub repository. However, it stills has some hardcoded keys values, so it will only work on the bytecode sample contained in the enc file for now.

Just execute the ./decomp.py file to give it a try, it will output pseudo-JavaScript. xhr1 and xhr2 are byte arrays that contains the data later sent to Google servers.

Gathered information

Google servers will receive and process, at least, the following information:

  • Plug-ins
  • User-agent
  • Screen resolution
  • Execution time, timezone
  • Number of click/keyboard/touch actions in the <iframe> of the captcha
  • It tests the behavior of many browser-specific functions and CSS rules
  • It checks the rendering of canvas elements
  • Likely cookies server-side (it's executed on the www.google.com domain)
  • And likely other stuff...

You can look at the decompiled bytecode for more precision.

This information, along with numeric values hardcoded in the bytecode (forcing a potential bot to read all of it), is sent to the https://www.google.com/recaptcha/api2/frame page. Look at the M.prototype.Q function to see how the encoding process is realized. Some of information (the one I call xhr2 in the decompiler, which is retrieved in the this.c[this.g] variable − xhr1 is in this.c[this.d]) is also encrypted with XTEA.

What's next...

We could:

  • Make statistics about when the checkbox-captcha suffices and when it doesn't.
  • Programmatically bypass the captcha by interpreting bytecode.
  • Programmatically bypass the captcha by simply executing a rendering engine and automating movements of the mouse. But it would be slighty less funny.

Cheers and good reversing!