Pages

Tuesday, January 30, 2007

Google, phpBB and session ids

Okay, an update on that hungry googlebot sucking bandwidth. I ended up emailing google to ask them to slow the bot and got an interesting reply. According to google its the session ids that cause 'problems for our robots'. Google refer you to a posting about removing session ids from phpBB.

Only problem is that the information they refer to is from 2004. If you look over at the phpBB forum there is a knowledge base article "Why doesn't google spider my forum?" This again refers to session ids and has the same code changes, but this one is from 2002.

As its now 2007 phpBB has changed a bit since 2002 and 2004, so its not so simple a change to make. Here is a quick summary of what I found and what I did.

First the change they suggest to sessions.php will not work as expected any more. The line of code they change does not exist any more, its been broken up. Also further down the code after their change the function updates or sets the session data. If it can't do that it sets the session id to a md5 hash of a random number. Also the code for the pages is a bit naughty and references both the global $SID and session_id in the $userdata hash.

My solution to all of this was to at the start of the function session_begin to declare the gobal $HTTP_SERVER_VARS; and the right at the end of the function before it assigns all of the data to the $userdata hash overwrite the session_id there.

Therefore the end of the function now looks like this, the first lines in blue are the new code.


if (strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot')) {

$session_id = '';
}

$userdata['session_id'] = $session_id;
$userdata['session_ip'] = $user_ip;
$userdata['session_user_id'] = $user_id;
$userdata['session_logged_in'] = $login;
$userdata['session_page'] = $page_id;
$userdata['session_start'] = $current_time;
$userdata['session_time'] = $current_time;
$userdata['session_admin'] = $admin;
$userdata['session_key'] = $sessiondata['autologinid'];

setcookie($cookiename . '_data', serialize($sessiondata), $current_time + 31536000, $cookiepath, $cookiedomain, $cookiesecure);
setcookie($cookiename . '_sid', $session_id, 0, $cookiepath, $cookiedomain, $cookiesecure);

$SID = 'sid=' . $session_id;

return $userdata;

Only problem I can conceive might happen is that it will insert all of the session keys into the database earlier on and they might not get cleaned up. Doubt it but something to check. In a day or two I should be able to see how it goes.

I tested with wget and without forcing the user agent to googlebot I got ?sid= with real id's in just about every URL in a page. With the user agent specified the same get to a page had an empty ?sid= value. So it looks a go.

No comments:

Post a Comment