November 2005

Systems and Travels29 Nov 2005 at 14:23 by Jean-Marc Liotier

My real time earth view used to only feature a view centered on Europe, Middle East, Africa and the Atlantic. I now also provide an Asia centered one and another one centered on the Americas.

As before, views are calculated every handful of minutes, cloud cover is updated eight times a day and the daylight background map is NASA’s Blue Marble‘s monthly map automatically rotated in place the first day of each month.

The available views and resolutions :

Systems05 Nov 2005 at 17:24 by Jean-Marc Liotier

While looking for a way to remove the <meta name="ROBOTS" content="NONE"%/> meta tag from some of the pages produced by Geneweb I stumbled upon a relatively new tool with interesting potential – mod_publisher :

Mod_publisher turns the URL mapping of mod_proxy_html into a general-purpose text search and replace. Whereas mod_proxy_html applies rewrites to HTML URLs, and in version 2 extends that to other contexts where a link might occur, mod_publisher extends it further to allow parsing of text wherever it can occur.

Unlike mod_proxy_html there is no presumption of the rewrites serving any particular purpose – this is entirely up to the user. This means we are potentially parsing all text in a document, which is a significantly higher overhead than mod_proxy_html. To deal with this, we provide fine-grained control over what is or isn’t parsed, replacing the simple ProxyHTMLExtended with a more general MLRewriteOptions directive.

My feeling is that the authors are considerably understating how much CPU this thing is going to cost. Production-minded people were certainly cringing at that thought while reading the description, but I foresee immense power for hacks of last resort.

Systems05 Nov 2005 at 17:05 by Jean-Marc Liotier

By default Geneweb asks robots to abstain from indexing the pages it generates. I wanted to :

  • Make the content of my genealogy database indexable by search engines.
  • Avoid putting the host under too much CPU load resulting from visits by spiders.
  • Keep the spiders from getting lost into the infinite navigation that Geneweb produces.

It is the special “non-person” pages (such as an ascendant tree) that are the most computationally intensive. It is also these pages that make the navigation infinite. So the functional constraints can be condensed into the following technical ones :

The first step was therefore to bypass the robots.txt generated by Geneweb. I use gwd in ‘server mode’ behind a Apache vhost with mod_rewrite so all I had to do was to add a mod_rewrite directive to hide Geneweb‘s robots.txt with mine :

RewriteEngine On
ProxyPass /
ProxyPassReverse /
ProxyPass /robots.txt

But that was not enough because Geneweb embeds a <meta name="ROBOTS" content="NONE"%/> tag into each page it generates. Geneweb provides a separate template for each page class. I guessed that etc/perso.txt is the template for what I call the “person page” and removed the <meta name="ROBOTS" content="NONE"%/> line from it.

And that was it : the person pages do not repulse friendly spiders anymore while the other pages are still off limit.

I love Geneweb !

Systems02 Nov 2005 at 2:24 by Jean-Marc Liotier

The latest addition to the collection of lame scripts I wrote and put online completely automates the trivial yet tedious task of Awstats batch Apache log reports production, with full history and even where multiple vhosts coexist.

And the icing on the cake is that it does it quite efficiently : it always updates the reports for the current month and the current year, but only produces other reports if they do not exist. To force the regeneration of a report, you simply erase it.

If a user wishes to control the access to a report for a vhost he must create a .htaccess file named /etc/awstats/
This file will be automatically detected and used. This is dead simple, and it just works.

Grab the code ! It is in production on this very server as the sample output testifies.