I just maybe apparently (inch’Allah !) about (I am definitely unsure…) got rid of (“solved” would probably be too strong a word for that haphazard process) a very vexing problem that kept me wondering for weeks.

On an otherwise very healthy Debian Linux host I saw idle /USR/SBIN/CRON processes begin to accumulate by the hundreds at a rate of a few every few minutes and after some time inducing significant load although some of them eventually died. Killing them was only a temporary remedy as they kept reappearing. I could not link their appearance to specific cron jobs nor could I link them to a specific command. And hours of sifting through forums and mailing lists yielded nothing conclusive : /USR/SBIN/CRON processes not terminating were not unheard of but their causes seemed to be varied and most often quite mysterious.

Liberal use of strace with various combinations of ‘-p’ ‘-f’ ‘-F’ and ‘-ff’ binding to the running cron daemon process and following vforks showed that the undead processes were left listening on an open connection. I also observed that the /USR/SBIN/CRON spawning was inhibited by an attached strace – in presence of strace the children did receive their missing SIGSTOP. And sometimes days went by with no manifestation of the dreaded processes – but as soon as I thought the problem was solved they began to reappear…

Anyway, finding that the undead processes were left listening on an open connection was the smelly trail I was looking for. ‘netstat -p | grep tcp | grep CRON’ soon showed me that each one of them had an open connexion to the local LDAP server. Then ‘lsof | grep cron | grep ldap’ hinted that it was not the cron process itself that was directly connecting to the LDAP server but an underlying library involved in our PAM LDAP user management system.

Armed with those new results I went hunting for some wild data and found a discussion between Robert Rakowicz and Jerome Reinert about a somewhat similar problem. But the maintenance operations Jerome Reinert suggested on slapd‘s Berkeley DB database did not solve the problem.

For now I have read another post mentioning that versions mismatches and assorted maintainance issues in slapd‘s Berkeley DB database can cause a similar problem. I can’t find the adress anymore but if I do I’ll post it here. We found that a simple slapd restart got us rid of the undead /USR/SBIN/CRON. It has been a few days and I have not seen one again… We keep our fingers crossed – maybe an upgrade silently fixed the problem…

Meanwhile I posted this to debian-user just in case someone there recognizes this problem as something familiar…

Since then I have seen the problem appear again and restarting slapd temporarily fixed it. I am using slapd 2.2.26 from Debian. Maybe I should upgrade to 2.3.23 : although it is available through Debian Unstable it has been released upstream one year ago so maybe I should trust it…