By default Geneweb asks robots to abstain from indexing the pages it generates. I wanted to :

  • Make the content of my genealogy database indexable by search engines.
  • Avoid putting the host under too much CPU load resulting from visits by spiders.
  • Keep the spiders from getting lost into the infinite navigation that Geneweb produces.

It is the special “non-person” pages (such as an ascendant tree) that are the most computationally intensive. It is also these pages that make the navigation infinite. So the functional constraints can be condensed into the following technical ones :

The first step was therefore to bypass the robots.txt generated by Geneweb. I use gwd in ‘server mode’ behind a Apache vhost with mod_rewrite so all I had to do was to add a mod_rewrite directive to hide Geneweb‘s robots.txt with mine :

RewriteEngine On
ProxyPass / http://kivu.grabeuh.com:2317/
ProxyPassReverse / http://kivu.grabeuh.com:2317/
ProxyPass /robots.txt http://www.bensaude.org/robots.txt

But that was not enough because Geneweb embeds a <meta name="ROBOTS" content="NONE"%/> tag into each page it generates. Geneweb provides a separate template for each page class. I guessed that etc/perso.txt is the template for what I call the “person page” and removed the <meta name="ROBOTS" content="NONE"%/> line from it.

And that was it : the person pages do not repulse friendly spiders anymore while the other pages are still off limit.

I love Geneweb !