LDAP cache tuning

From Messaging Server Technical Reference Wiki
Jump to: navigation, search

LDAP Cache Tuning

7/24/09: This document has been moved to the Communications Suite Information wiki at LDAP Cache Tuning. Refer to that page from now on.

NOTE: This document is a work-in-progress

Background

The topic of how to tune service.authcachesize and service.authcachettl came up recently. For various historical reasons, a customer had the size set way too big -- so big that the behavior seemed like a memory leak, which took days or weeks to build up to a system-crippling level. The Comm6 configutil doc now has a comment which better describes how these work. That doc will be officially corrected eventually.

But this still leaves us with the question of how to tune these cache parameters. Which leads to the question of how to observe the effects of their settings and the general discussion of the many different caches exist in Messaging Server. Even if we limit that to just caches of LDAP information, is a very big topic.

Store auth cache

The auth cache (controlled by the service.authcachesize and service.authcachettl configutil parameters) is a per-process cache used to reduce LDAP requests for repeated logins by the same user. The following dtrace script can be used to monitor the usage of this cache in an individual imapd, popd, or mshttpd process on a 32-bit Solaris 10 system.

#!/usr/sbin/dtrace -qs

pid$target::digestcache_find:entry
{
/*  @cache_users[ustack(), arg0] = count(); */
  @cache_misses[arg0]   = max((int)*(int *)copyin(arg0+12,4));
  @cache_hits[arg0]     = max((int)*(int *)copyin(arg0+16,4));
  @cache_MinEnts[arg0]  = min((int)*(int *)copyin(arg0+20,4));
  @cache_MaxEnts[arg0]  = max((int)*(int *)copyin(arg0+20,4));
  @cache_expires[arg0]  = max((int)*(int *)copyin(arg0+24,4));
  @cache_overflow[arg0] = max((int)*(int *)copyin(arg0+28,4));
}

profile:::tick-20s
{
  exit(0);
}

END
{
/*  printa("%k 0x%x %@d\n", @cache_users); */
 printf("\n   cache %8s %8s %8s %8s %8s %8s\n", "misses", "hits", "minEnts", "maxEnts",
    "expires", "overflow");
 printa("0x%x %@8d %@8d %@8d %@8d %@8d %@8d\n", @cache_misses, @cache_hits, @cache_MinEnts,
    @cache_MaxEnts, @cache_expires, @cache_overflow);
}

As coded above, it will wait 20 seconds and then end with the report. You can change the "tick-20s" to change the time out or Ctrl/C. If there is no activity (ie, on an idle lab system), it will produce nothing - because it monitors the cache statistics structures as the cache is being used. So if there is no use, we get nothing:

# ./authcache_stats.d -p 8043

   cache   misses     hits  minEnts  maxEnts  expires overflow
#

If you log in to IMAP while the dtrace script is running, it may look like:

   cache   misses     hits  minEnts  maxEnts  expires overflow
0x292058       11       20        3        3        8        0
0x2f3b70       13        8        1        1        5        0

If you uncomment the two references to @cache_users, the report will also show what routines use the cache and from that you can more exactly determine which cache is which.

misses, hits, expires, and overflows can only go up. But the number of entries can go up and down, so we record both the min and max. Actually, if nEntries goes down, it is only for a brief time until the entry is reused - either because of an expire or an overflow. So we could say that entries will only go up, but we are more interested to see how much it moves during the monitoring period.

misses is incremented both when we simply do not find what we were looking for in the cache and when we found it, but its TTL had expired. In both cases, we have to go to LDAP to do the lookup.

hits is obvious - we found what we were looking for and used it.

The cache hit rate (ie, hits/(hits+misses)) depends on the number of repeat logins during the TTL and the distribution of logins to the number of processes for each service. In other words less processes (e.g. 64bit) == better cache hit ratio.

minEnts and maxEntrs are the lowest and highest we saw for the number of entries in the cache. This is entries that are not free. They may be under or over their TTL - we don't know that until we happen to look for the entry later. So this just means we put this item in the cache and it is still there - not whether its TTL is valid or not.

expires is when we found what we were looking for in the cache, but the TTL had expired, so we had to do the LDAP lookup again anyway.

overflows is the number of times we wanted to add an entry to the cache, but there were already the maximum number of entries in the cache, there were none free, so we had to free the least frequently used entry.

The size of an entry depends on 32-bit vs 64bit. The main body of this cache is preallocated with the process starts up. However a non-free entry in the first cache also contains an LDAP result structure. So the simplest, although not perfectly accurate, way to describe this is to say 3k bytes per entry.

An overflows is not a bad thing. It just means cache entries are being reclaimed from the least-frequently-used list.

You want to balance memory use vs load on the LDAP server. If you wanted to try to completely avoid the LDAP server, you would set the cache large enough to contain all the users on this system and the TTL to a very high value. Then you would never see any expires or overflows. Of course this could use a lot of memory and will cause problems when you try to change passwords or user status. So you also need to balance the TTL against your need for updates in the LDAP service to be noticed.

On the heavily loaded systems at the customer who had service.authcachesize set way too high, we found the default value, 10,000 was adequate. So it seems the short version of this long story is that you don't need to tune this. But it would be interesting to see some cache usage statistics from various customers if you want to paste yours here.

More recently, we saw the following stats from a customer with a completely different load profile. This imapd had been running about 20 hours:

   cache   misses     hits  minEnts  maxEnts  expires overflow
0x255b70    60200    10960    10000    10000     8132    42068
0x1f4058    65096     6064    10000    10000     5350    49746

The hit rate ( hits/(hits+misses) ) is not good: 15% and 8%, respectively. About 18 minutes later, the same process had:

0x255b70    61957    11246    10000    10000     8406    43551
0x1f4058    66990     6213    10000    10000     5535    51455

The delta over ~18 minutes showed hit rate of 13% and 7%:

             1757      286                        274     1483
             1894      149                        185     1709

This customer was experiencing delays logging in to IMAP. So maybe it is appropriate for them to increase either the size or the TTL (or both) depending on how often the same users log in again within a relatively short time. Or perhaps having 8 imapd processes defeats the auth cache. Examining imap log files showed, for example, one user logged in 11 times in 7 minutes and the LDAP access log showed 8 lookups to authenticate that user. During that time, there were only ~3500 different users logging in. Why was that user looked up more than once? Because he was unlucky enough to hit all 8 different imapd processes in which his entry, if it existed, had expired.

So another consideration for tuning the auth cache is the number of processes. Setting the numprocesses too high will defeat the auth cache.

MTA LDAP caches

  • MTA: dispatcher.cnf LDAP domain, users (groups?) -- positive and negative
  • domain map -- in MTA, store access daemons, and MMP

Cache information can be collected on a per-tcp_smtp_server process basis by running the xsta command e.g.

bash-2.05$ telnet server 25
Trying 129.158.87.191...
Connected to server.aus.sun.com.
Escape character is '^]'.
220 server.aus.sun.com -- Server ESMTP (Sun Java(tm) System Messaging Server 6.3-6.03 (built Mar 14 2008; 32bit))
xsta
<snip>
250-2.3.0 Alias cache statistics:
250-2.3.0   Hits                0
250-2.3.0   Misses              1
250-2.3.0   Adds                1
250-2.3.0   Deletes             0
250-2.3.0   Timeouts            0
250-2.3.0   Entries             1
250-2.3.0   Percent used        0.100100
250-2.3.0   Percent chains used 0.390625
250-2.3.0   Ave chain length    1.000000
250-2.3.0   Max chain length    1
250-2.3.0

MMP auth cache

The Messaging Server 6.3 and 7.0 Messaging MultiPlexor use the High performance User Lookup and Authentication (HULA) libraries for user authentication. The following dtrace script can be used on 32-bit Solaris 10 installations to determine various statistics about the cache performance.

<mmpcache_stats.d>
#!/usr/sbin/dtrace -qs

pid$target::user_incache:entry
{  
       self->hula_cache = (int)*(int *)copyin(arg0+24,4);
       this->hent = (int)*(int *)copyin((self->hula_cache)+12,4);

       @cache_minEnt[self->hula_cache]  = min(this->hent);
       @cache_maxEnt[self->hula_cache]  = max(this->hent);
       @cache_avgEnt[self->hula_cache]  = avg(this->hent);
}

pid$target::user_incache:return
/arg1 == 0/
{
        @cache_miss[self->hula_cache] = count(); 
}

pid$target::user_incache:return
/arg1 != 0/
{
        @cache_hit[self->hula_cache] = count();
}

profile:::tick-20s
{
        printf ("    cache %8s %8s %8s %8s %8s\n", "minimum", "maximum", "average", "hits", "misses");
        printa ("0x%x %@8d %@8d %@8d %@8d %@8d\n", @cache_minEnt, @cache_maxEnt, @cache_avgEnt, @cache_hit, @cache_miss);
        printf ("\n");
}
</mmpcache_stats.d>

To run this script, provide the PID of the AService process to the script e.g.

./mmpcache_stats.d -p 26741
    cache  minimum  maximum  average     hits   misses
0x839a24c        0        0        0        0        3
0x83968bc        4        4        4        3        0

Statistics are provided once every 20 seconds and will continue till Ctrl-C is run. The interval between statistic reporting is controlled by the "profile:::tick-20s" line.

In this above example there are two cache line addresses. One is for the IMAP service and the other is for the POP service. The "hits" and "misses" fields are cumulative from the start of the script. A higher hit-to-miss ratio is preferred and will be determined by the pattern of traffic, the size of the cache and the TTL (time-to-live) of the cache entries

The "minimum", "maximum" and "average" fields describe the number of entries in the cache since the startup of the script.

The HULA cache mechanism operates differently to the store auth cache. When a cache lookup is performed, existing expired cache entries in the same hash "bucket" are removed which helps to minimise the overall number of entries in the cache. Therefore if you find that the cache size approaching the maximum cache size, the size should be increased to ensure a high hit-to-miss ratio.

UWC LDAP cache

To be added.

Convergence LDAP cache

To be added.

Other Performance Tuning Resource