My Experience With the Great Firewall of China

When I recently visited China for the first time, as an InfoSec professional I was very curious to finally be able to poke at the Great Firewall of China with my own hands to see how it works and how easy it is evade. In short I was surprised by:

Its high level of sophistication such as its ability to exploit side-channel leaks in TLS (I have evidence it can detect the "TLS within TLS" characteristic of secure web proxies)
How poorly simple Unix computer security tools fared to evade it
2 of the top 3 commercial VPN providers in China, ExpressVPN and Astrill, use RSA keys so short (1024 bits!) that the Chinese government could factor them [Edit 2016-02-15: acting on my report, these 2 providers retired the short keys and now use 2048- or 4096-bit keys.]

Why evade the GFW?

Most westerners who visit China have a perfectly legitimate reason for evading the GFW: it blocks all Google services. That means no Gmail to access your airline e-ticket, no Hangouts to stay in touch with your family, no Maps to find your hotel, no Drive to access your itinerary document. This was my primary need for evading it.

Before visiting China I prepared myself a bit. On my phone I pinned documents in Drive to access them offline. In Maps I preloaded the locations I was going to visit by zooming in on them to load all the streets and points of interest nearby—the new offline Google Maps feature did not exist at the time. But Maps turned out to be almost unusable anyway: my GPS position was always offset by hundreds of meters from its true location due to the China GPS shift problem. (Google could fix it by using WGS-84 coordinates for their Chinese maps; why have they not done it already?)

Idea 1

So I arrived at my hotel in Beijing, tried to load google.com, and it errored out due to TCP RSTs sent by the GFW to block the connection. My first idea was to set up an SSH SOCKS tunnel (ssh -D) from my laptop to a server colocated in a datacenter in the USA, and I configured Chrome to use it:

$ google-chrome --proxy-server=socks://127.0.0.1:1080
$ ssh -D 1080 my-server

This worked fine for a few minutes. Then severe packet loss, around 70-80%, started occuring. Restarting the tunnel fixed it for a few minutes. But the packet loss eventually returned, affecting all traffic to my server no matter what type: SSH connections, or simple pings. It is not clear why the GFW drops packets. Some say it is to intentionally disrupt VPNs without outright blocking them. Or perhaps the GFW selectively redirects some suspicious packets to a subsystem for deeper inspection and this subsystem is overloaded and unable to cope with all the traffic.

Whatever the reason is, this packet loss made the SOCKS tunnel too slow and unreliable to be usable.

Idea 2

I tried a slightly different approach: running a web proxy (polipo) on my server listening on 127.0.0.1:$port and using SSH port redirection (ssh -L) to access it:

$ google-chrome --proxy-server=127.0.0.1:1234
$ ssh -L 1234:127.0.0.1:$port my-server

Again, this worked fine for a few minutes, but the packet loss returned. The GFW is clearly able to detect and interfere with SSH carrying bulk traffic.

Idea 3

Instead of SSH, why not access the proxy over a TLS connection? This should make it harder for the GFW to detect it since the traffic patterns of a user accessing a proxy over TLS are close to the traffic patterns of a user accessing an HTTPS site.

Making a web proxy available over TLS is what we call a secure web proxy, which is not common to the point that most browsers do not support it. So I used stunnel to wrap the proxy connection in TLS and to expose an unencrypted proxy endpoint to my laptop.

Of course I had to protect the setup with authentication. But I could not use standard proxy authentication because if the GFW actively connects to it, the "407 Proxy Authentication Required" error would expose it. And I did not want to use TLS client authentication because this might raise a small red flag that this might some sort of TLS-based VPN. Again I needed to make my secure web proxy endpoint look like and act like a regular HTTPS endpoint as much as possible.

So I wrote a small relay script in Python which listens on $port_a and forwards all connections to another endpoint $host_b:$port_b. The relay can run in 2 modes. In "client mode" (on my laptop) it inserts a 128-bit secret key as the first 16 bytes sent through the connection. In "server mode" (on my server) it verifies this key, and only forwards the connection if the key is valid, or else the data is discarded and dropped which makes it look like a non-responsive web server.

The setup looked like this on my laptop:

Browser configured to use proxy on 127.0.0.1:5000
Relay listens on 127.0.0.1:5000, inserts the key, and forwards to 127.0.0.1:5001
stunnel client listens on 127.0.0.1:5001, wraps the connection in TLS, and forwards to my-server:5002

And on the server:

stunnel server listens on my-server:5002, unwraps the connection, and forwards to 127.0.0.1:5003
Relay listens on 127.0.0.1:5003, verifies the key (removes it), and forwards to 127.0.0.1:5004
Web proxy listens on 127.0.0.1:5004

Result? This worked well! No packet loss, no problems whatsoever.

What does the GFW see on the wire when browsing an HTTP site through the proxy? A packet capture of "curl --head http://www.google.com" shows this on my system (size of TLS records shown in parentheses):

C: TCP SYN to proxy
S: TCP SYN+ACK reply from proxy
C: TCP ACK
C: ClientHello (86 bytes)
S: ServerHello, Certificate, ServerHelloDone (67+858+9 bytes)
C: ClientKeyExchange, ChangeCipherSpec, encrypted Finished (267+6+53 bytes)
S: NewSessionTicket, ChangeCipherSpec, encrypted Finished (207+6+53 bytes)
C: encrypted ApplicationData #1 (37+197 bytes)
S: encrypted ApplicationData #2 (37+693 bytes)

(Side note: ApplicationData records are split in 2 records, the first one of 37 bytes, because of the 1/n-1 record splitting workaround for BEAST.)

There is a TCP handshake, a TLS handshake, an encrypted ApplicationData record sent by the client of about 200 bytes (the HTTP request), and an encrypted ApplicationData record sent by the server of about 700 bytes (the HTTP response). In fact this TLS exchange and traffic pattern is similar to a non-proxied HTTPS connection, which is why the GFW fails to detect it as an evasion technique.

Unfortunately, as soon as I started browsing HTTPS sites through my proxy, the GFW detected it and impacted it with a high packet loss... How can it be?

Idea 4

When browsing an HTTPS site through a secure proxy there are 2 layers of TLS: the outer TLS connection to the proxy and the inner TLS connection to the site. I theorized that the GFW is able to guess that the encrypted ApplicationData records hide a proxy CONNECT request and another TLS handshake. Here is what a packet capture looks like for "curl --head https://www.google.com" through the proxy:

C: TCP SYN to proxy
S: TCP SYN+ACK reply from proxy
C: TCP ACK
C: ClientHello (86 bytes)
S: ServerHello, Certificate, ServerHelloDone (67+858+9 bytes)
C: ClientKeyExchange, ChangeCipherSpec, encrypted Finished (267+6+53 bytes)
S: NewSessionTicket, ChangeCipherSpec, encrypted Finished (207+6+53 bytes)
C: encrypted ApplicationData #1 (37+197 bytes)
S: encrypted ApplicationData #2 (37+69 bytes)
C: encrypted ApplicationData #3 (37+325 bytes)
S: encrypted ApplicationData #4 (37+3557 bytes)
C: encrypted ApplicationData #5 (37+165 bytes)
S: encrypted ApplicationData #6 (37+85 bytes)
C: encrypted ApplicationData #7 (37+149 bytes)
S: encrypted ApplicationData #8 (37+853 bytes)

To the GFW, these 8 ApplicationData records could look like 4 pairs of HTTP requests and responses in a keep-alive connection. However as research has shown [5] [6], side-channel leaks in TLS can be exploited, for example by looking at packet sizes. Doing so, we can see that they indeed match the expected sizes of the messages exchanged during a CONNECT request and a TLS handshake:

C: encrypted ApplicationData #1 (37+197 bytes):
"CONNECT www.google.com:443 HTTP/1.1\r\nHost:... \r\nUser-Agent:... \r\n\r\n" which is typically 200-300 bytes
S: encrypted ApplicationData #2 (37+69 bytes):
35-byte "HTTP/1.1 200 Tunnel established\r\n\r\n" proxy response. But with 1/n-1 record splitting, a 20-byte SHA-1 MAC per record (my stunnel was using the AES128-SHA cipher suite), padding to align with a 16-byte AES block, and 5 bytes of TLS record header, this translates exactly to a 37-byte and 69-byte record
C: encrypted ApplicationData #3 (37+325 bytes):
ClientHello which is typically 200-300 bytes if it advertises dozens of cipher suites (you may notice the ClientHello in the outer TLS connection is only 86 bytes but that is because my stunnel instances were configured to only allow 1 cipher suite)
S: encrypted ApplicationData #4 (37+3557 bytes):
ServerHello, Certificate, optional ServerKeyExchange, ServerHelloDone, which are typically 1000-4000 bytes combined (space mostly used by the certificate and optional certificate chains)
C: encrypted ApplicationData #5 (37+165 bytes):
ClientKeyExchange, ChangeCipherSpec, encrypted Finished, which are typically 200-300 bytes combined
S: encrypted ApplicationData #6 (37+85 bytes):
optional NewSessionTicket, ChangeCipherSpec, encrypted Finished, which are typically 100-300 bytes combined
C: encrypted ApplicationData #7 (37+149 bytes):
HTTP request
S: encrypted ApplicationData #8 (37+853 bytes):
HTTP response

Specifically, if ApplicationData #2 is very short (it is extremely rare to see an HTTP reply shorter than "HTTP/1.1 200 Tunnel established"), and if ApplicationData #4 is around 1-4kB (certificates + certificate chain), and if ApplicationData #6 is less than 300 bytes (HTTP responses this small are less rare but still uncommon), then the probability of that exchange hiding a CONNECT request and TLS handshake is high.

To verify my theory that the GFW exploits these side-channel leaks, I modified the relay script to pad each relayed data block smaller than 1500 bytes to a random length between 1000 and 1500 bytes:

if len_pkt < 1000:
  len_pad = randint(1000 - len_pkt, 1500 - len_pkt)
else:
  len_pad = randint(0, 1500 - len_pkt)

Result? This worked very well! With random padding I was able to browse normally censored HTTP and HTTPS sites for multiple hours without slowdown, without packet loss caused by the GFW.

It was pretty fascinating to test how reliable enabling/disabling the random padding was. I would disable it and the packet loss would return in minutes. I would re-enable it and I could browse for hours. I would disable it again, and the loss would reappear instantly.

I learned through this experience that the GFW is unmistakably able to exploit side-channel leaks in TLS, such as packet sizes in order to detect the "TLS within TLS" characteristic of secure web proxies. This really surprised me. I had no idea the GFW had reached this level of sophistication.

The next day, the packet loss returned. But if I simply used a different port number for the proxy, everything would continue to work fine for another day or so. I think this time the GFW was not blocking me based on side-channel leaks, but based on network metrics. 100% of the network traffic to/from my server crossing the Chinese border was to my public IP in China, so the GFW probably learned my TCP endpoint was likely used as a private VPN, as opposed to being a public HTTPS site accessed by many client IPs.

GFW uses machine learning

None of the information above is new to those familiar with the GFW. It is only after I reached this point in my tests that I did some deeper reading and learned that the GFW uses machine learning algorithms to learn, discover, and block VPNs and proxies.

It all makes sense now: the GFW engineers do not even have to define explicit rules like I described above (if ApplicationData #2 is short, if ApplicationData #4 is around 1-4kB, etc). They train their models using various VPN and proxy setups, and the algorithms learns the characteristics of those connections to identify them automatically.

ExpressVPN

My proxy setup and custom relay script injecting random padding were running on my laptop which I could use at the hotel, and it worked very well. But I also needed a solution for my phone when out on the streets.

I used the commercial service ExpressVPN which seems to be 1 of the top 3 VPN service used to evade the GFW. It is simple and easy to configure: I installed their Android app and I was up and running in no time. ExpressVPN built their service on OpenVPN and have dozens of VPN servers located in many countries.

However I was not pleased when I saw that their OpenVPN root CA certificate RSA key size is only 1024 bits! Why, why, why? The Chinese government is one of the archetype "state-level adversaries" that crypto is supposed to protect us from. This ExpressVPN weakness has been reported and noted multiple times [1] [2].

It is believed that $10 million of specialized hardware can factor 1024-bit RSA keys [3] [4]. There is a high computing cost per key, but if I were China and could factor at least a few RSA keys, surely the root CA key of 1 of the top 3 VPN providers in the country would be one of my targets. Doing so would give them the ability to actively man-in-the-middle ExpressVPN connections and decrypt the traffic. It is possible that China is already doing so and spying on some (all?) ExpressVPN users.

Below is the current ExpressVPN root CA certificate with a 1024-bit RSA key, extracted from the OpenVPN configuration files they distribute to users. Its serial number is 14845239355711109861 (0xce04e28a62cf3ae5) and it is valid from Jul 19 09:36:31 2009 GMT to Jul 17 09:36:31 2019 GMT:

Also, I am confused by the fact the Chinese government allows this well-known VPN provider (and others) to operate freely in the country. They could very easily deploy low-tech ways to block access to the ExpressVPN service, for example by filtering or redirecting the DNS records of their VPN hosts, which is something they do to block certain website hosts. But they do not do it to block ExpressVPN, why? One possible explanation could be that the Chinese government did factor the ExpressVPN root CA key and does spy on the network traffic of their users, but they prefer to not interfere with ExpressVPN in order to give their users a false sense of privacy. If China blocked the service, users would migrate to other more secure VPN services, and China would lose a SIGINT ability.

Many countries other than China have internet censorship capabilities that rival or surpass the capabilities of the GFW. I would be curious to poke at them too.

[Edit: I am well aware of some open source VPN tools that work quite well in China: ShadowVPN / ShadowSocks (whose developer was recently pressured by Chinese authorities to empty the GitHub repository), Obfsproxy (wiki), Softether, etc. My goal was to find out by trial and error the minimum amount of tricks needed to evade the GFW. And I found that a secure web proxy with packet size randomizaton (idea 4) worked perfectly well to evade it.]

[Edit 2016-01-22: I contacted ExpressVPN about their weak RSA key and they replied: "We agree that the issue you have raised is important, and you're correct in that it has been on our backlog to fix for some time. We've now decided to prioritize the upgrade for the next month". Also, I am told another VPN provider popular in China, Astrill, appears to use weak RSA keys.]

[Edit 2016-01-23: So Astrill is also using OpenVPN. They define 2 root CAs (CN=ASCA, and CN=ASCA2). The second one is 2048-bit, but the first one is 1024-bit. This means an active man-in-the-middle attack could intercept and decrypt all Astrill VPN traffic by impersonating malicious OpenVPN servers authentified by the CN=ASCA certificate. It has serial number 10853689667623641679 (0x96a00d3f5508e24f) and is valid from Oct 6 16:58:51 2010 GMT to Oct 3 16:58:51 2020 GMT:

I contacted Astrill support, we will see what they say.]

[Edit 2016-01-25: The Astrill Chief Security Officer personally emailed me, thanked me for the report and said "Effective today 1024bit cert (ASCA) has been removed from PKI and all clients are required now to use 2048bit cert". Woohoo!]

[Edit 2016-01-26: Official statements were posted by ExpressVPN and Astrill.]

[Edit 2016-02-15: ExpressVPN reported to me they finished upgrading their CA keys from 1024- to 4096-bit keys. Yay!]