Home > January 2007

January 2007

Google, phpBB and session ids

Tuesday, January 30, 2007 0

Okay, an update on that hungry googlebot sucking bandwidth. I ended up emailing google to ask them to slow the bot and got an interesting reply. According to google its the session ids that cause 'problems for our robots'. Google refer you to a posting about removing session ids from phpBB.

Only problem is that the information they refer to is from 2004. If you look over at the phpBB forum there is a knowledge base article "Why doesn't google spider my forum?" This again refers to session ids and has the same code changes, but this one is from 2002.

As its now 2007 phpBB has changed a bit since 2002 and 2004, so its not so simple a change to make. Here is a quick summary of what I found and what I did.

First the change they suggest to sessions.php will not work as expected any more. The line of code they change does not exist any more, its been broken up. Also further down the code after their change the function updates or sets the session data. If it can't do that it sets the session id to a md5 hash of a random number. Also the code for the pages is a bit naughty and references both the global $SID and session_id in the $userdata hash.

My solution to all of this was to at the start of the function session_begin to declare the gobal $HTTP_SERVER_VARS; and the right at the end of the function before it assigns all of the data to the $userdata hash overwrite the session_id there.

Therefore the end of the function now looks like this, the first lines in blue are the new code.



if (strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot')) {

$session_id = '';
}

$userdata['session_id'] = $session_id;
$userdata['session_ip'] = $user_ip;
$userdata['session_user_id'] = $user_id;
$userdata['session_logged_in'] = $login;
$userdata['session_page'] = $page_id;
$userdata['session_start'] = $current_time;
$userdata['session_time'] = $current_time;
$userdata['session_admin'] = $admin;
$userdata['session_key'] = $sessiondata['autologinid'];

setcookie($cookiename . '_data', serialize($sessiondata), $current_time + 31536000, $cookiepath, $cookiedomain, $cookiesecure);
setcookie($cookiename . '_sid', $session_id, 0, $cookiepath, $cookiedomain, $cookiesecure);

$SID = 'sid=' . $session_id;

return $userdata;

Only problem I can conceive might happen is that it will insert all of the session keys into the database earlier on and they might not get cleaned up. Doubt it but something to check. In a day or two I should be able to see how it goes.

I tested with wget and without forcing the user agent to googlebot I got ?sid= with real id's in just about every URL in a page. With the user agent specified the same get to a page had an empty ?sid= value. So it looks a go.

Hungry Googlebot

Tuesday, January 09, 2007 0

Wow that Googlebot can get hungry. I admin a site and forum for my 4WD club and the google crawler has been sucking up the bandwidth big time. Over 60% of the bandwidth for the month is from Google. If you do a search for "excessive bandwidth usage by googlebot" it turns out there are a few people who are having the problem.

The forum is phpBB so there is a lot of cruft around the messages which has to get sent to Google with every page that they request, even though the content is only small. Some clever person should write some code that determines if the browser agent is a bot and only return a very simple version of the page with the key content and some simple links for it to follow.

Well hopefully it will finish indexing soon ...

Powered by Blogger.