Sunday, May 22, 2011

PostgreSQL - strange database server errors under Windows

I often use PostgreSQL as a database server for my web applications which are usually deployed under Tomcat. PostgreSQL and Tomcat are registered as system services which starts automatically when Windows starts. 

In one of my application I have about 50 different databases with connection pools for each database, where each pool has about 10 connections opened. So I have about 500 connections opened for requests. Each opened connection is visible as a separate postgres.exe process in the system next to the own PostgreSQL processes also visible as postgres.exe (usually 5 processes). Sometimes under Windows XP Prof. SP 3 I saw PostgreSQL stop working and in the server logs there were errors like:
  •  java.net.SocketException: No buffer space available (maximum connections reached?): recv failed
or:  
  • server process (PID XXXX) was terminated by exception 0xC0000142 HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.
--------------------------------------------
No buffer space available error:

"Maximum connection reached" may suggest that we set too low value of max_connections in postgresql.conf file. Let's assume that settings for connections amount and required memory are OK for the server, but the problem is still visible from time to time.

The first thing to be checked is amount of short-lived ports (ephemeral). Those ports are used by applications during their normal work. They are opened and live until certain application works, in comparison to server application ports that are typically open for the entire time that the server computer is running. They are assigned automatically from predefined range. Range differs between operating systems:

a) Windows 2000 -> from 1024 to 5000
b) Windows 2003 Server -> from 1024 to 5000
c) Windows XP (all service packs) -> from 1024 to 5000
d) Windows Vista -> from 49152 to 65535
e) Windows 2008 Server -> from 49152 to 65535
f) Windows 7 -> from 49152 to 65535
g) Linux kernels -> from 32768 to 61000 (see /proc/sys/net/ipv4/ip_local_port_range)

If the available pool of ephemeral ports is exahsuted we see Windows WSAENOBUFS error (no buffer space, code 10055). In order to check for exceeded range, we have to see how many processes are running in the system and how many ports they are opening - we can use free TCPView application. We can kill the process that causes exceeding or we can increase the range of ephemeral ports. If we want to increase it, we have to change a registry entry named maxUserPort in:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\maxUserPort
 
If that registry entry does not exist it means that system uses default value (see above) and we have to create this entry. Then we have to set a proper value for that entry (REG_DWORD) from the range 5000 to 65534. 

The second thing to be checked is limit of half-opened TCP/IP connections (connection with state SYN-RCVD). When this limit is exceeded, TCP/IP starts up SYN flooding attack protection (when SynAttackProtect is enabled). This limit differs between operating systems:

a) Windows 2000 Workstation -> 500
b) Windows 2000 Server -> 100
c) Windows Server 2003 -> 500 
d) Windows XP SP 1 -> 100
e) Windows XP SP 2, SP3 -> 10
f) Windows Vista SP 1 -> 10
g) Windows Vista SP 2 -> limit removed (not implmented = unlimited)
h) Windows Server 2008 SP 1 -> 10
i) Windows Server 2008 SP 2 -> limit removed (not implemented = unlimited)
f) Windows 7 -> limit removed (not implemented = unlimited) 

We can use mentioned TCPView application to check for half-opened TCP connections. This limit can be increased by changing the registry entry named TcpMaxHalfOpen in:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpMaxHalfOpen

If that registry entry does not exist it means that system uses default value (see above) and we have to create this entry and then manualy assign new value.

Under Windows Vista SP2, Windows 7 or Linux I did not see that error.

--------------------------------------------
0xC0000142 error:

This problem is not so easy to explain, although I was able to fix it. The error code means that "A dynamic link library (DLL) initialization routine failed." I am not sure what exacly causes this... 

In my case under Windows XP Prof. SP 3, when I started my application and number of postgres.exe processes was about 100 (it should be about 500, see description at the top) the whole database server process was suddenly terminated and I saw second mentioned error (with error code 0xC0000142) in the logs. The most important part here is that PostgreSQL service, which spawned many postgres.exe child processes, was registered under LOCAL SYSTEM account, without "Allow Service to Interact with Desktop" option set.

It seems that this problem is not only connected with PostgreSQL. It is general problem with services and their subprocesses (see this and that article). From the information found the problem my be connected with so-called desktop heaps (see more here and here). Some people having similar problems with many subprocesses did the workaround by setting "Allow Service to Interact with Desktop" to true for their LOCAL SYSTEM services. I was able to fix it under Windows XP Prof. SP 3, by registering PostgreSQL service for regular user account instead of LOCAL SYSTEM (see "Long Version" from my article about manual installation of PostgreSQL how to do this). I also checked this under Windows Vista SP2 and Windows 7 - it works OK there, even under LOCAL SYSTEM account. Under Linux I did not see that error.

I did not try to assign more desktop heap from default for Windows XP SharedSection=1024,3072,512 (2nd or 3rd value needs to be changed) and then try it again to run under LOCAL SYSTEM without "Allow Service to Interact with Desktop". Perhaps this might also help.