Nginx resolver vulnerabilities allow cache poisoning attack

Never configure nginx with the resolver directive pointing to a resolver on the Internet like Google Public DNS, OpenDNS, or your ISP’s resolver. Many nginx users make this exact mistake. Even pointing to a resolver on your internal local network may be a bad idea. Using a resolver on localhost (resolver 127.0.0.1) is the only safe option, and mitigates against all vulnerabilities documented in this post.

[Edit 11 January 2018: 1.5 years after reporting these vulnerabilities to the nginx developers, they still refuse to fix issues 2, 3, and 6. They did fix issues 1, 4, and 5, but did so without publishing a security advisory.]

I discovered that not only nginx’s stub resolver generates non-random or predictable DNS txids, but also each nginx worker process reuses the same UDP source port for every DNS query.

Nginx on Windows fails to seed its PRNG, causing txids to always follow the same sequence: 41, 18467, 6334, 26500, 19169, 15724…

Nginx on Linux (GNU C library) generates txids that can be predicted with a short script.

These flaws allow someone to send spoofed DNS replies to poison the resolver cache, causing nginx to proxy HTTP requests to an arbitrary upstream server chosen by the attacker, potentially serving malicious content to browsers.

Strangely the nginx developers do not consider any of these problems a security issue. Following my report to security-alert@nginx.org, they patched some issues, refused to patch others, and did not publish a security advisory. So I feel compelled to disclose my findings publicly to inform nginx users of potential risks they expose themselves to.

Details
Attack scenarios
- Scenario 1
- Scenario 2
Vendor response
Vulnerable nginx versions
Email 1
Email 2
Email 3
Email 4
Email 5
Email 6
Email 7

Details

Nginx generates DNS txids using ngx_random() which is a macro for random() on Linux/Unix, and rand() on Windows.

Issue 1. On Windows, nginx never seeds the PRNG via srand(), so the C library defaults to a seed of 1, causing the txids to be non-random and to always follow the same sequence: 41, 18467, 6334, 26500, 19169, 15724…

Issue 2. On Linux, relying on random() makes txids predictable because, under certain configurations, nginx leaks random() values to remote users: through SMTP/POP3/IMAP CRAM-MD5 salts and through POP3 APOP timestamps (when ngx_mail_core_module is enabled), or through the $request_id variable sometimes exposed to clients via headers (when nginx is not built with ngx_http_ssl_module)¹. And as explained in scenario 2, with the GNU C library implementation, knowing 31 consecutive random() values allows predicting 2 possible values for the next random() call. This information can be obtained, for example, by sending 31 commands AUTH CRAM-MD5 to nginx’s SMTP endpoint.

Issue 3. On all platforms, each worker process in nginx lets the OS pick a random UDP source port to send DNS queries. The worker processes will then reuse this same port over and over for all queries, until nginx is restarted. Since guessing the txid is possible, an attacker only needs to brute force the source port in order to poison nginx’s resolver cache. The OS TCP/IP stack typically picks ports from a range of 28232 ports on Linux, and 16384 ports on Windows. Identifying this port is a one-time effort for the attacker. Subsequently, he can conduct cache poisoning attacks with a single spoofed DNS reply because the port is known and the txid predictable.

Other issues—which are more minor—affect nginx:

Issue 4. On Linux/Unix with master_process off (not the default because this is a development setting normally not set) the PRNG is seeded with srandom(ngx_time()) however the time in seconds when nginx is started can typically be guessed from the host’s uptime (if nginx is started at boot time) which is known through TCP timestamps.

Issue 5. On Linux/Unix, with master_process on (default), the process ID is XOR’d into the seed but the PID typically has low entropy if nginx is started during boot (~10 bits.)

Issue 6. On Windows, rand() limits the entropy of DNS txids to 15 bits, not 16 bits (MAX_RAND is 0x7fff).

Attack scenarios

First, the requirements for successfully poisoning nginx’s DNS cache are that the attacker needs to:

know which hostnames nginx attempts to resolve, and
know which resolver nginx is configured to use (eg. the attacker could guess 8.8.8.8 or 8.8.4.4 since Google Public DNS is a popular choice.)

Some additional information or configuration may help, as described in the scenarios below, but is not necessary.

Scenario 1

This scenario demonstrates how to exploit issue 1.

Nginx running on Windows with the following nginx.conf:

# nginx.conf
http {
  server {
    listen       9980;
    server_name  frontend.example.com;
    resolver     8.8.8.8;
    location / {
      set $target http://backend.example.com;
      proxy_pass $target;
    }
  }
}

If the Nginx instance is relatively busy (eg. handles one request per second or more, to keep nginx constantly resolving the backend name every time the cached entry expires), then:

The attacker looks up the backend’s hostname and sees a TTL of 1 hour.
Now he can determine that when nginx is started and proxies its first request at t=0 it will issue a DNS query with txid=41, when the TTL expires at t=1h nginx will issue a DNS query with txid=18467, etc. This is because, as I explained earlier, on Windows nginx fails to seed the PRNG, therefore the same sequence of txids will always be generated.
The attacker determines when nginx was started to know when it will send its next DNS query and with which txid. For example he uses TCP timestamps to determine the host’s uptime, and assumes nginx was started at boot.
The attacker waits for the backend IP address to expire from nginx’s cache. For example this may happen at t=100h, so the attacker knows nginx will use the 101st random value in the sequence of values returned by rand(). When it is expired from the cache, nginx queries 8.8.8.8, which may take a few hundred milliseconds to reply. At the same time the attacker makes his first poisoning attempt by blindly sending multiple spoofed DNS replies with the predicted txid to various UDP ports. He may be able to send, say, 1000 spoofed replies per poisoning attempt.
The first poisoning attempt will likely fail because Windows selects a source port from a range of 16384 ports (49152–65535). So if he fails, the attacker tries again the next time the backend IP address expires from nginx’s cache, effectively giving him an unlimited number of tries. If the TTL is 1 hour and if he sends 1000 spoofed replies per attempt, on average the attack will succeed after ~16 poisoning attempts conducted over a period of 16 hours.
Nginx will then proxy all requests to the attacker’s server of its choice until the TTL expires. The next poisoning attempt will be much simpler because at this point the attacker knows which set of 1000 ports is valid. He could even precisely determine which port was the correct one,² in which case the next poisoning attempt can be conducted with a single packet sent to the known port, using the predicted txid.

As of 24 August 2016, this attack works on all nginx versions released so far. After my report, it was patched on 04 August 2016 in the development branch, therefore future release 1.11.4 should include the fix. However the nginx developers do not consider it a security vulnerability, despite the obvious risk.

Scenario 2

This scenario demonstrates how to exploit issue 2 on Linux.

It requires nginx to be configured as an SMTP or POP3 or IMAP endpoint with CRAM-MD5 authentication enabled (not the default), or to be configured with the POP3 APOP authentication (not the default), or to be configured to disclose $request_id to clients while not being compiled with ngx_http_ssl_module. These are three mechanisms by which random() values leak to a remote attacker, allowing him to predict future txids.

In the case of APOP, the random() value is disclosed in the APOP timestamp in the POP3 greeting banner. But I choose here instead to demonstrate Nginx running with an SMTP endpoint and CRAM-MD5 authentication:

# nginx.conf
http {
  server {
    listen       9980;
    server_name  frontend.example.com;
    resolver     8.8.8.8;
    location / {
      set $target http://backend.example.com;
      proxy_pass $target;
    }
  }
}
mail {
  server_name    my-name;
  auth_http      http://localhost/;
  smtp_auth      cram-md5;
  server {
    listen   1025;
    protocol smtp;
  }
}

With this configuration, nginx leaks random() values through CRAM-MD5 salts used in SMTP/POP3/IMAP authentication:

$ telnet localhost 1025
[...]
220 my-name ESMTP ready
HELO foo
250 my-name
AUTH CRAM-MD5
334 PDE2OTM3OTc3NS4xNDcxOTgyODMxQG15LW5hbWU+

$ echo PDE2OTM3OTc3NS4xNDcxOTgyODMxQG15LW5hbWU+ | openssl base64 -d
<169379775.1471982831@my-name>

The integer 169379775 in <169379775.1471982831@my-name> is the value returned by random(). The GNU C library implementation of this function allows predicting future return values when knowing 31 consecutive ones. I wrote a demonstration Python script cram-md5-predict-random.py which connects to nginx’s SMTP endpoint, fetches 31 CRAM-MD5 salts, and calculates 2 possible values for the next random() number. One of the values has 75% chances of being correct, the other 25%.³ It also prints 2 possible DNS txids which are the same values but masked to 16 bits:

$ ./cram-md5-predict-random.py 
Connecting to localhost:1025...
[...]
Next random() value will be: 693589235 (75% prob.) or 693589236 (25% prob.)
Next DNS txid will be: 21747 or 21748

Next we can verify this txid prediction by requesting http://localhost:9980 in a browser while running a packet capture of nginx attempting to resolve backend.example.com.

$ tcpdump -n "port 53 and host 8.8.8.8"
[...]
15:15:31.419931 IP 10.110.0.117.44335 > 8.8.8.8.53: 21747+ A? backend.example.com. (37)
15:15:31.527168 IP 8.8.8.8.53 > 10.110.0.117.44335: 21747 NXDomain 0/1/0 (94)

The txid is 21747 as predicted!

The steps an attacker would take to poison nginx’s DNS cache are:

The attacker runs cram-md5-predict-random.py to predict the next txid value (which has 75% chances of being correct.)
He then either adopts a brute force approach and starts constantly sending spoofed DNS replies to nginx to different UDP ports; eventually the backend IP address will expires from nginx’s cache, nginx will attempt to re-resolve it and will accept a spoofed reply sent to the correct port. Or alternatively the attacker may save network bandwidth by sending the spoofed DNS replies only when nginx attempts to re-resolve the backend hostname.⁴
When nginx re-resolves the hostname, it may take a few hundred milliseconds to receive the legitimate reply. During this time the attacker is sending spoofed DNS replies—let’s call it his “first poisoning attempt”—and he may have time to send, say, 1000 spoofed replies with the predicted txid but sent to 1000 different UDP ports.
The first poisoning attempt will likely fail because Linux by default selects the port from a range of 28232 ports (net.ipv4.ip_local_port_range = 32768 60999). So if he fails, the attacker tries again the next time the backend IP address expires from nginx’s cache, effectively giving him an unlimited number of tries. If the TTL is 1 hour and if he sends 1000 spoofed replies per attempt, then on average, if the predicted txid value that had 75% chances of being correct is in fact correct, the attack will succeed after ~28 poisoning attempts conducted over a period of 28 hours.
Nginx will then proxy all requests to the attacker’s server of its choice until the TTL expires. However, similarly to scenario 1, the next poisoning attempt can be conducted with a single packet sent to the known port, using the predicted txid.

I should point out that this attack works well if nginx is configured with only 1 worker process. If there are 2 or more, things become a little more complicated as each process uses their own UDP source port and have their own random() PRNG state.

As of 24 August 2016, this attack works on all nginx versions, including the current development branch. The nginx developers do not plan to fix it.

Vendor response

I reported these vulnerabilities to the developers, proposing 2 fixes:

Abandon random()/rand() as a PRNG to generate txids. Use instead entropy from the getrandom(2) syscall, /dev/urandom, or CryptGenRandom().
Randomize the UDP source port for every DNS query.

On one hand they fixed the lack of seeding on Windows (issue 1), and made the seeding less predictable on all platforms by XOR’ing in milliseconds (issues 4 and 5).

But on the other hand they refused to consider any of these issues a security problem, they did not publish a security advisory, they will not address the rest of the vulnerabilities (issues 2, 3, and 6): they continue to use random()/rand(), and they still fail to randomize source ports for each query. Therefore txids are still predictable and nginx’s cache can still be poisoned as per scenario 2.

Their argument is that none of these issues matter because they assume the resolver is trusted. I was told “the suggested use case is to operate a name server on localhost, or on the (trusted) local network. We’ll make sure to make it clear in documentation.”

(As of 25 August 2016, they still have not updated their documentation.)

I pointed out in the past they did patch DNS vulnerabilities that can only be exploited by an untrusted resolver or untrusted network: CVE-2016-0742, CVE-2016-0746, and CVE-2016-0747. Therefore this is a complete flip-flop from their previous stance that DNS is untrusted.

Vulnerable nginx versions

As of 24 August 2016, all nginx versions that ever shipped with the resolver are affected by issues 1 through 6:

0.6.18 up to and including 0.6.39
0.7 up to and including 0.7.69
0.8 up to and including 0.8.55
1.0 up to and including 1.0.15
1.2 up to and including 1.2.9
1.4 up to and including 1.4.7
1.6 up to and including 1.6.3
1.8 up to and including 1.8.1
1.10 up to and including 1.10.1
1.11 up to and including 1.11.3

Future releases such as 1.11.4 should contain the fix for issues 1, 4, and 5 but will still be vulnerable to issues 2, 3, and 6.

Email 1

This is the report I initially sent to the nginx developers on 27 July 2016:

From: Marc Bevand

Hello,

I discovered that the built-in stub resolver in nginx fails to securely randomize the DNS txid (transaction ID). See the three lines like this one in ngx_resolver.c:

ident = ngx_random();

ngx_random is a macro for random() on Linux/Unix, and rand() on Windows. However these PRNGs are weak, predictable, and not at all or poorly seeded by nginx:

On Windows, nginx never seeds it via srand() so the C library defaults to a seed of 1, causing the txids to always use the same sequence: 41, 18467, 6334, 26500, 19169, 15724…!
On Linux, the GNU C library implementation of random() enables prediction of the next few values with very high confidence when knowing the 31 previous ones. And nginx may leak these previous values in various ways, for example through the CRAM-MD5 salts of the ngx_mail_core_module, or through the $request_id variable if exposed to HTTP clients (both call ngx_random() to generate entropy and return the full 31-bit value.)
On Linux, even if the previous random() values are not known, with master_process off, nginx seeds the PRNG with srandom(ngx_time()) however the time in seconds can typically be guessed from the host’s uptime which is known through TCP timestamps.
On Linux/Unix, with master_process on, the process ID is XOR’d into the seed but the PID typically has low entropy if nginx is started during boot (~10 bits.)
On Windows, even if the PRNG was well seeded and not predictable, the Microsoft C library implementation of rand() limits the entropy of values to 15 bits anyway, not 16 (MAX_RAND is 0x7fff), which makes guessing a txid twice more likely.

Also, nginx fails to randomize the UDP source port for each DNS query. It relies on the OS to pick a random port, but then uses it for all subsequent queries indefinitely. Reloading the config (nginx -s reload) does not change the port. An nginx instance will continue to use the same port until nginx is stopped and restarted.

Combined together, the predictable txids and invariable UDP source ports make DNS cache poisoning attacks possible. All nginx versions released in the last 9 years are affected (since 0.6.18 implemented the “resolver” directive).

The sections below provide more details.

## Windows

Let me start with demonstrating the behavior of nginx on Windows. I created a simple nginx config:

resolver 10.246.148.9;
location /1/ { set $target http://name-1.lan; proxy_pass $target; }
location /2/ { set $target http://name-2.lan; proxy_pass $target; }
location /3/ { set $target http://name-3.lan; proxy_pass $target; }

Note that using variables (proxy_pass $target) is necessary to demonstrate the vulnerability because this leads to hostname resolution at runtime using ngx_resolver.c. If a plain URL is specified (proxy_pass http://name-1.lan) the hostname will be resolved during nginx configuration initialization using gethostbyname() or getaddrinfo().

Here is a tcpdump capture of the DNS queries sent by nginx right after starting it up and after requesting /1/, /2/, and /3/ (.10 is nginx and .9 is the recursive resolver):

IP 10.246.148.10.56524 > 10.246.148.9.53: 41+ A? name-1.lan. (28)
IP 10.246.148.10.56524 > 10.246.148.9.53: 18467+ AAAA? name-1.lan. (28)
IP 10.246.148.10.56524 > 10.246.148.9.53: 6334+ A? name-2.lan. (28)
IP 10.246.148.10.56524 > 10.246.148.9.53: 26500+ AAAA? name-2.lan. (28)
IP 10.246.148.10.56524 > 10.246.148.9.53: 19169+ A? name-3.lan. (28)
IP 10.246.148.10.56524 > 10.246.148.9.53: 15724+ AAAA? name-3.lan. (28)

On Windows, nginx generates the txids using rand() and the values seen above (41, 18467, 6334, 26500, 19169, 15724…) correspond to the values generated by the Microsoft C library implementation of rand() with the default seed value of 1. In other words: nginx fails to seed the PRNG, therefore all txids are trivially predictable.

Also, nginx relies on Windows picking a random UDP source port (56524 in my example), but this port will be reused indefinitely for all subsequent queries (for name-1.lan, name-2.lan…) The only way the port will change is if nginx is stopped and restarted.

These two flaws are a textbook example of a bad DNS implementation vulnerable to cache poisoning. Whenever an HTTP client requests a URL proxied to a host whose record cached by nginx has expired, ngx_resolver.c will send a query to the recursive resolver which may take tens or hundreds of milliseconds to reply. This creates a window of opportunity for the attacker to send hundreds or thousands of spoofed DNS replies with different UDP source ports and txids. During this time nginx will log one error per response with a mismatched txid:

[error] 18256#0: wrong ident 12428 response for name-1.lan, expect 51926

But if guessed correctly, the ngx_resolver.c cache would be poisoned and would cause nginx to pass the proxied HTTP requests to an IP chosen by the attacker, potentially serving malicious content to HTTP clients.

## Linux/Unix

On Linux/Unix, nginx generates the txids using random() and the PRNG is seeded in src/os/unix/ngx_process_cycle.c:

srandom((ngx_pid << 16) ^ ngx_time());

However this too is insecure. For starters random() and rand() are not cryptographically secure PRNGs therefore future txids can be predicted based on previous ones. For example, with the GNU C library implementation, knowing 31 consecutively generated random values lets you calculate the next few ones with high confidence.

And in nginx’s particular case, the seed itself is not very random. ngx_pid is the process ID which is likely a small integer in the 0–2000 range if nginx is started at boot time (~10 bits of entropy). As to ngx_time() it is the time in seconds since Epoch when nginx was started, or approximately the host’s uptime if nginx is started at boot time. But the uptime can be guessed pretty accurately, for example the majority of Linux hosts indirectly disclose their uptime through TCP timestamps (“nmap -v -O x.x.x.x” and look for “Uptime guess” which is often accurate within minutes).

Oh, another bad thing on Linux: if nginx is configured with master_process off (not the default) the PRNG is instead initialized with only the time (not XOR’d with the PID) in src/os/unix/ngx_posix_init.c:

srandom(ngx_time());

## Other exploitable scenarios

I demonstrated an exploitable scenario when using proxy_pass, but of course other ones are possible, for example when proxying to hostnames using fastcgi_pass, uwsgi_pass, scgi_pass, or memcached_pass.

## Suggested fixes

### txid

The standard random() or rand(), even well seeded, are not cryptographically secure PRNGs and should not be used to generate DNS txids. PowerDNS had the same vulnerability. Microsoft too. Pretty much all DNS implementations have switched to cryptographically secure PRNGs since July 2008 when Dan Kaminsky disclosed the DNS flaw, and so should nginx.

Even though exploitability may be reduced in the context of nginx (because the window of opportunity to spoof DNS replies only exists when a cached record expires), nginx should use a cryptographically secure PRNG to generate the txid.

On Linux kernels >= 3.17, I suggest using the getrandom(2) syscall. On older Linux kernels and other Unix platforms, nginx should read from /dev/urandom. On Windows, nginx should use the CryptGenRandom() API.

### UDP source port

nginx should explicitely bind the resolver socket to a random UDP source port also chosen using a cryptographically secure PRNG. And it should use a different port for every DNS query if possible. If not possible the port should at least be rotated periodically, eg. after every N DNS queries with a small value of N (10?)

I was enticed to do a security review of nginx because the Internet Bug Bounty program provides bounties through the HackerOne plattform. So, FYI, I plan to submit my reports to them after you resolve the issues.

Let me know if you need more information. I am happy to help!

Thanks,
-Marc

Email 2

I received a reply the same day:

Date: Wed, 27 Jul 2016 11:26:48 +0300
Message-ID: <89b5303f-5113-eaf6-f196-8d6a339a6fa3@nginx.com>
Subject: Re: nginx DNS cache poisoning
From: Maxim Konovalov <maxim@nginx.com>
To: Marc Bevand <m.bevand@gmail.com>
Cc: security-alert@nginx.org

Hi Marc,

[...]

thanks for the comprehensive report.

We are going to analyze it and answer in the following couple of weeks.

Just a side note before the official answer: nginx resolver should
be always used in a properly secured environment. In this particular
case some ingress rules among with other measures should be in place
to prevent DNS spoofing.

This is actually a universal advise for any similar setups.

-- 
Maxim Konovalov

Email 3

Maxim and I exchanged 4 more emais (not listed here) to clarify the exact risk. Maxim was mostly defensive, but said one of his resolver devs will look into it.

On 03 Aug 2016, the developer in charge of the resolver replied:

Date: Thu, 4 Aug 2016 01:16:15 +0300
Message-ID: <20160803221615.GG95280@lo0.su>
Subject: Re: nginx DNS cache poisoning
From: Ruslan Ermilov <ru@nginx.com>
To: Marc Bevand <m.bevand@gmail.com>
Cc: security-alert@nginx.org

Hi Marc,

On Wed, Jul 27, 2016 at 01:13:15AM -0500, Marc Bevand wrote:
[...]
>    - On Windows, nginx never seeds it via `srand()` so it always uses the
>    same sequence of txids (41, 18467, 6334, 26500, 19169, 15724...!)

Good catch.  We're internally reviewing the patch that fixes
PRNG seeding on Windows.  Surprisingly, on Windows srand()
only seeds the current thread:
https://msdn.microsoft.com/en-us/library/f0d4wb4t.aspx
Moreover, new threads do not inherit the current sequence point,
and PRNG inside new threads behave as if srand() was called with
the seed value of 1.

# HG changeset patch
# User Ruslan Ermilov <ru@nginx.com>
# Date 1470262541 -10800
#      Thu Aug 04 01:15:41 2016 +0300
# Node ID 09c918460cc67f1c4e9c222e36051dc0343b2c20
# Parent  d43ee392e825186545d81e683b88cc58ef8479bc
Win32: added per-thread random seeding.

The change in b91bcba29351 was not enough to fix random() seeding.
On Windows, the srand() seeds the PRNG only in the current thread,
and worse, is not inherited from the calling thread.  Due to this,
worker threads were not properly seeded.

Reported by Marc Bevand.

diff --git a/src/os/win32/ngx_process_cycle.c b/src/os/win32/ngx_process_cycle.c
--- a/src/os/win32/ngx_process_cycle.c
+++ b/src/os/win32/ngx_process_cycle.c
@@ -764,6 +764,8 @@ ngx_worker_thread(void *data)
     ngx_int_t     n;
     ngx_cycle_t  *cycle;
 
+    srand((ngx_pid << 16) ^ (unsigned) ngx_time());
+
     cycle = (ngx_cycle_t *) ngx_cycle;
 
     for (n = 0; cycle->modules[n]; n++) {

>    - On Linux, the GNU C library implementation of `random()` enables
>    prediction of the next few values with very high confidence when knowing
>    the 31 previous ones (I can provide a test program to demonstrate this).
>    And nginx may leak these previous values in various ways, for example
>    through the CRAM-MD5 salts of the `ngx_mail_core_module`, or through the
>    `$request_id` variable if exposed to HTTP clients (both call `ngx_random()`
>    to generate entropy and return the full 31-bit value.)
>    - On Linux, even if the previous `random()` values are not known, with
>    `master_process off`, nginx seeds the PRNG with `srandom(ngx_time())`
>    however the time in seconds can typically be guessed from the host's uptime
>    which is known through TCP timestamps.

The "master_process off" mode is not intended for production
use, so no surprise here: http://nginx.org/r/master_process

>    - On Linux/Unix, with `master_process on`, the process ID is XOR'd into
>    the seed but the PID typically has low entropy if nginx is started during
>    boot (~10 bits.)
>    - On Windows, even if the PRNG was well seeded and not predictable, the
>    Microsoft C library implementation of `rand()` limits the entropy of values
>    to 15 bits anyway, not 16 (`MAX_RAND` is `0x7fff`), which makes guessing a
>    txid twice more likely.
> 
> Also, nginx fails to randomize the UDP source port for each DNS query. It
> relies on the OS to pick a random port, but then uses it for all subsequent
> queries indefinitely.
>
> Reloading the config (`nginx -s reload`) does not change the port.

This is not true for the normal case on UNIX when there are
separate worker processes (master_process on).

> An nginx instance will continue to use the same port until
> nginx is stopped and restarted.

True.  The socket() and connect() syscalls are made only
once for each worker process for each name server.

> Combined together, the predictable txids and invariable UDP source ports
> make DNS cache poisoning attacks possible. All nginx versions released in
> the last 9 years are affected (since 0.6.18 implemented the "resolver"
> directive).

The only reason why nginx implements its own resolver is
that libc resolver is not async.

The general rule of thumb is that the resolver that you
configure in nginx should be trusted.  The same holds
true for upstream servers (for many reasons, including
X-Accel-Redirect).  The resolver code in nginx is mainly
used to resolve names of upstream servers when proxying.

The suggested use case is to operate a name server on
localhost, or on the (trusted) local network.  We'll
make sure to make it clear in documentation.

> The sections below provide more details.
[...]
> On Windows, nginx generates the txids using `rand()` and the values seen
> above (41, 18467, 6334, 26500, 19169, 15724...) correspond to the values
> generated by the Microsoft C library implementation of `rand()` with the
> [default seed value of 1](http://www.pyaray.com/articles/random.htm). In
> other words: nginx fails to seed the PRNG, therefore all txids are
> trivially predictable.

True, this needs to be fixed.

[...]
> Let me know if you need more information. I am happy to help!

To summarize:

- we're not considering much of it a real problem

- we'll document the requirement of trusted DNS

- we'll fix PRNG seeding on Windows

Email 4

I responded:

Date: Sat, 6 Aug 2016 23:07:47 -0500
Message-ID: <CADH-5r1D7OxLbqD75YQmGc2F5OOt27rgwgutW2tSbXMDsNCT-A@mail.gmail.com>
Subject: Re: nginx DNS cache poisoning
From: Marc Bevand <m.bevand@gmail.com>
To: Ruslan Ermilov <ru@nginx.com>
Cc: security-alert@nginx.org

Hi Ruslan,

Sorry for my delayed reply. I am relocating from San Diego to Saint Louis
and the move is keeping me busy!

Your patch to call srand() on Windows in the worker threads is the right
place to seed. I don't have a Windows compiler environment to test, but the
patch looks correct.

> The "master_process off" mode is not intended for production
> use, so no surprise here: http://nginx.org/r/master_process

I am aware of that. But even Windows with "master_process off" seeds with
pid & time, so why not Linux? For the sake of consistency it should be
seeded the same way:

  1. With master_process off, on Linux:   seed with time only (see ngx_posix_init.c)
  2. With master_process off, on Windows: seed with pid & time
  3. With master_process on,  on Linux:   seed with pid & time
  4. With master_process on,  on Windows: seed with pid & time (as fixed by your patch)

I would not be surprised to learn there are strange/unusual nginx
deployments that decide to run with "master_process off". Such users would
unknowingly expose themselves to poor seeding.

> This is not true for the normal case on UNIX when there are
> separate worker processes (master_process on).

You are correct, I failed to test with master_process on.

> The only reason why nginx implements its own resolver is
> that libc resolver is not async.
> The general rule of thumb is that the resolver that you
> configure in nginx should be trusted.

Side note: libc resolvers do take care of randomizing the port for each
query, and do take care of randomizing the txid without using predictable
RNG like rand()/random(). If the nginx resolver is meant to replace the
libc resolver, it should adopt at least the same security measures. Yes the
attack surface is more limited in nginx because an attacker does not
control which hostnames are resolved (and he may not know them), but the
attack surface does exist.

You say the resolver should be trusted, but then why did you guys publish
this security advisory describing vulnerabilities that can only be
exploited by a malicious *untrusted* resolver (or an attacker spoofing the
resolver)?:
http://mailman.nginx.org/pipermail/nginx-announce/2016/000169.html This
advisory mentions "an attacker who is able to forge UDP packets from the
DNS server", and this is precisely the attack I describe is possible
because the port and txid are not securely randomized, thereby enabling
forging.

I strongly suggest the nginx team to adopt a consistent security policy:
either you always trust the resolver (shouldn't have published the above
advisory), or never trust it (need to securely randomize both the port and
txid).

-Marc

Email 5

I had not received a reply after a few weeks, so I published this blog post and sent another email:

Date: Wed, 24 Aug 2016 14:45:30 -0500
Message-ID: <CADH-5r0Rnj8tRihr0n8gWinTFg-WGEsxoO0_pP-6s6fxxneZdg@mail.gmail.com>
Subject: Re: nginx DNS cache poisoning
From: Marc Bevand <m.bevand@gmail.com>
To: Ruslan Ermilov <ru@nginx.com>
Cc: security-alert@nginx.org

Hello,

You never explained why the nginx team changed their stance about DNS
vulnerabilities. To re-explain your inconsistency: in the past you worried
about *"an attacker who is able to forge UDP **packets"*:
http://mailman.nginx.org/pipermail/nginx-announce/2016/000169.html but
today you claim there is a *"**requirement of trusted DNS"*.

You never replied to my email, so I take it you don't care about the
vulnerabilities I reported. Therefore I decided to publicly disclose them
in order to make nginx users aware of the risks they take:
http://blog.zorinaq.com/nginx-resolver-vulns/

Also, you never updated your documentation as you said you would.

-Marc

Email 6

They replied:

Date: Thu, 25 Aug 2016 14:56:14 +0300
Message-ID: <20160825115614.GC28899@lo0.su>
Subject: Re: nginx DNS cache poisoning
From: Ruslan Ermilov <ru@nginx.com>
To: Marc Bevand <m.bevand@gmail.com>
Cc: security-alert@nginx.org

On Wed, Aug 24, 2016 at 02:45:30PM -0500, Marc Bevand wrote:
> Hello,
> 
> You never explained why the nginx team changed their stance about DNS
> vulnerabilities. To re-explain your inconsistency: in the past you worried
> about *"an attacker who is able to forge UDP **packets"*:
> http://mailman.nginx.org/pipermail/nginx-announce/2016/000169.html but
> today you claim there is a *"**requirement of trusted DNS"*.

The relevant quote from the above message:

- Invalid pointer dereference might occur during DNS server response
  processing, allowing an attacker who is able to forge UDP
  packets from the DNS server to cause worker process crash
  (CVE-2016-0742).

- Use-after-free condition might occur during CNAME response
  processing.  This problem allows an attacker who is able to trigger
  name resolution to cause worker process crash, or might
  have potential other impact (CVE-2016-0746).

- CNAME resolution was insufficiently limited, allowing an attacker who
  is able to trigger arbitrary name resolution to cause excessive resource
  consumption in worker processes (CVE-2016-0747).

Out of these three, the last two allowed to attack nginx
even if trusted name server (such as the one operating on
127.0.0.1) was configured.  That's why we decided to issue
an advisory.  If it was only the first one, we would never
release it.

> You never replied to my email, so I take it you don't care about the
> vulnerabilities I reported.

The position is clearly stated at the end of "Email 3" that
you cited in full in your awesome blog post.

>  Therefore I decided to publicly disclose them
> in order to make nginx users aware of the risks they take:
> http://blog.zorinaq.com/nginx-resolver-vulns/

> Also, you never updated your documentation as you said you would.

This task is scheduled and will be completed soon.

[...]

-- 
Ruslan Ermilov

Email 7

Date: Thu, 25 Aug 2016 14:41:33 -0500
Message-ID: <CADH-5r3oqUCQQFMO7KCi2Q5EypjHt2ADqg8tG5K-kiarBa0xrQ@mail.gmail.com>
Subject: Re: nginx DNS cache poisoning
From: Marc Bevand <m.bevand@gmail.com>
To: Ruslan Ermilov <ru@nginx.com>
Cc: security-alert@nginx.org

Ok, I can see that you genuinely try to follow a certain logic into
classifying what is not and what is a vulnerability (eg. the last 2 of the
3 CVEs I quoted.) I disagree with this logic, but I understand it.

I urge you send the message *quickly and loudly* to your users that the
resolver running on a trusted network is a requirement. It is evident that
many users are unaware of it and use Google Public DNS, OpenDNS, etc. For
example compare this:

https://www.google.com/#q=nginx+%22resolver+8.8.8.8%22+OR+%22resolver+208.67.222.222%22
(30k results)

with that:
https://www.google.com/#q=nginx+%22resolver+127.0.0.1%22
(4k results)

Thanks,
-Marc

$request_id is generated from 4 calls to random(), see ngx_http_variable_request_id(), and this variable is sometimes exposed to HTTP clients through headers. However if nginx is built with ngx_http_ssl_module then $request_id is generated from OpenSSL’s RAND_bytes(). ↩
If the attacker controls 100 IP addresses, he could split his 1000 port guesses into 100 groups of 10 ports, and build spoofed DNS replies with a different IP address for each group. Then depending on which IP address nginx proxies requests to, the attacker can determine which group of 10 ports was correct. Similarly, if the attacker has access to 1000 IP addresses, he can pinpoint precisely which exact port guess was correct. ↩
The reason the script prints 2 possible values is because of 2 bits of the PRNG state which are unknown to the attacker. If both are 1 they generate a carry, incrementing the return value by 1. This means the probability the first predicted value is correct is 75% (one of the bits is 0), and the second predicted value 25% (both bits are 1). ↩
One way for the attacker to determine when the backend IP address expires from nginx’s cache is via the following timing attack. The attacker could send an HTTP request (proxied to the backend) to nginx once per second. If he notices a periodical anomaly in nginx’s response time (eg. the average response time is 100 milliseconds, but every hour one request takes 200 milliseconds), and if the time period correspond to the backend IP address TTL (1 hour), then he can reasonably assume the increased response time is caused by nginx resolving the backend hostname. ↩