Google, phpBB and session ids

Tuesday, January 30, 2007 0

Okay, an update on that hungry googlebot sucking bandwidth. I ended up emailing google to ask them to slow the bot and got an interesting reply. According to google its the session ids that cause 'problems for our robots'. Google refer you to a posting about removing session ids from phpBB.

Only problem is that the information they refer to is from 2004. If you look over at the phpBB forum there is a knowledge base article "Why doesn't google spider my forum?" This again refers to session ids and has the same code changes, but this one is from 2002.

As its now 2007 phpBB has changed a bit since 2002 and 2004, so its not so simple a change to make. Here is a quick summary of what I found and what I did.

First the change they suggest to sessions.php will not work as expected any more. The line of code they change does not exist any more, its been broken up. Also further down the code after their change the function updates or sets the session data. If it can't do that it sets the session id to a md5 hash of a random number. Also the code for the pages is a bit naughty and references both the global $SID and session_id in the $userdata hash.

My solution to all of this was to at the start of the function session_begin to declare the gobal $HTTP_SERVER_VARS; and the right at the end of the function before it assigns all of the data to the $userdata hash overwrite the session_id there.

Therefore the end of the function now looks like this, the first lines in blue are the new code.



if (strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot')) {

$session_id = '';
}

$userdata['session_id'] = $session_id;
$userdata['session_ip'] = $user_ip;
$userdata['session_user_id'] = $user_id;
$userdata['session_logged_in'] = $login;
$userdata['session_page'] = $page_id;
$userdata['session_start'] = $current_time;
$userdata['session_time'] = $current_time;
$userdata['session_admin'] = $admin;
$userdata['session_key'] = $sessiondata['autologinid'];

setcookie($cookiename . '_data', serialize($sessiondata), $current_time + 31536000, $cookiepath, $cookiedomain, $cookiesecure);
setcookie($cookiename . '_sid', $session_id, 0, $cookiepath, $cookiedomain, $cookiesecure);

$SID = 'sid=' . $session_id;

return $userdata;

Only problem I can conceive might happen is that it will insert all of the session keys into the database earlier on and they might not get cleaned up. Doubt it but something to check. In a day or two I should be able to see how it goes.

I tested with wget and without forcing the user agent to googlebot I got ?sid= with real id's in just about every URL in a page. With the user agent specified the same get to a page had an empty ?sid= value. So it looks a go.

Hungry Googlebot

Tuesday, January 09, 2007 0

Wow that Googlebot can get hungry. I admin a site and forum for my 4WD club and the google crawler has been sucking up the bandwidth big time. Over 60% of the bandwidth for the month is from Google. If you do a search for "excessive bandwidth usage by googlebot" it turns out there are a few people who are having the problem.

The forum is phpBB so there is a lot of cruft around the messages which has to get sent to Google with every page that they request, even though the content is only small. Some clever person should write some code that determines if the browser agent is a bot and only return a very simple version of the page with the key content and some simple links for it to follow.

Well hopefully it will finish indexing soon ...

DNS mystery

Wednesday, November 29, 2006 0

DNS has always been an interest of mine. Reading all of O'Reillys DNS book and joining AUDA when it was founded in Australia. Still a lot of people just don't get it ...

I stumbled across a good DNS checker site http://www.dnsreport.com/. You type in your domain name and it does a full test of everything, especially the MX and mail servers. In the times of increasing SPAM the setup of your MX is becoming real important as many providers are becoming so pedantic about everything being just right.

Its because of MX problems that I found this site. My provider, godaddy.com have some nasty email requirements. Sometimes people get a "553 Bogus helo" errors when emailing people hosted at GoDaddy. Can be hard to get this fixed.

Blast from the past

Sunday, November 26, 2006 0

Well its amazing the things that pop up. I stumbled across a cache of an old page which I had created against Outlook. It was linked to off Bugzilla and got a heap of hits as a result. I think I have a copy sitting around on disk somewhere so I might dig out the version with the images to keep it up for prosperity!

Wow, where have those last 5 years gone!

Getting Television Right

0


I am not a big fan of television. It can waste a lot of time when one could be doing something so much better. Yet there is some good things to watch. Years ago I saw the Tivo in America. Its a VCR on steroids. Essentially it records to hard disk rather than tape and manages your recording schedule. It took a while for such technology to make it to Australia but eventually it did.

Introducing the Topfield set-top box which is a hard disk TV recorder that works in Australia. The good thing is that its a digital tuner so you get great picture quality. There are two tuners inside so you can watch something whilst recording something else or record two programs at the same time.

Now the big thing that made the Tivo such a popular product was its integrated TV guide. The Toppy (the common name for Topfield units) does not have a program guide service included. Two solutions are available. The first is IceTV which is a commercial solution, that means you get reliability and accuracy. The seconds is TEDS which is a free product (you can get some extra features at a cost) which works just as well.

With an up to date program guide, two tuners and a massive amount of record time a Toppy is a great way to watch TV. Record the regular shows you want to watch. Watch them when its convenient and skip the adds.

Does a Toppy mean you watch more TV, I have not found so. What it does do is allow you to watch better shows rather than just watching whatever rubbish is on when you want to watch something.

Powered by Blogger.