Main Menu
Home
About
Archive
Zen Kernel
Downloads
Satellite
Dish Keys
SURGE
Links
Search
Search Bible
Feed Me!
 

 Subscribe

Add to Google

Add to Pageflakes

Subscribe in Bloglines

Add to My AOL



 




Top Technology blogs




Climbing Out of Supplemental Result Hell
Tuesday, 29 May 2007
Apparently, Joomla and Mambo aren't good when it comes to staying out of Google's supplemental index (appropriately named "Google Hell"), especially if an unaware webmaster enables the PDF and print icon options. I was previously unaware, and found that of my 700-800 links in Google's index, 699 of them were supplemental, meaning they are displayed at the very back of all of the search pages, and are rarely scanned by the Google spider. The cause of such a problem is either having duplicate content in large quantity, having lots of "nofollow" links (appear to Google as "spammy"), as well as links that appear to be purchased for PageRank purposes. Google sees these things as "spammy," a characteristic of the recently banned "Made For Adsense" sites. There are probably other reasons a site may be cast into this terrible place, but they are unknown, as Google likes to keep it's secrets.

The Joomla/Mambo specific problem is mainly the print and pdf function. As Google's spiders crawl the printable versions and PDF versions of pages, it sees them as duplicate "spammy" content. Thus, the PDF and the original content page counterpart are sent to Google Hell. To fix this problem, you need to do a few things:

First, and most simply, hide the print and PDF links by going into your Global Configuration area, and select the Content tab. Choose "Hide" for the PDF and Print icons. From now on, as Google scans your articles, it won't see those links.



The problem now is that a lot of your content is still in Google's index as "duplicate content" in PDF form. To fix this, you have to give Google an error the next time they scan the PDF's location. To do so, add the following to your main .htaccess file:

# Getting Rid of PDF Files
RewriteCond %{QUERY_STRING} ^option=com_content&do_pdf=
RewriteRule ^index2\.php$ - [G]

# Getting Rid of print preview Files
RewriteCond %{QUERY_STRING} ^option=content&task=view&id=
RewriteRule ^index2\.php$ - [G]

Note that this will disable completely the ability to view PDF and print versions of your pages.

  It is also a good idea to have a robots.txt file to tell the Googlebot what directories and pages it can and cannot index. Create a file named robots.txt, and paste the below rules inside. Save and upload to your site's root directory.

User-agent: googlebot
Disallow: /index.php
Disallow: /index2.php?option=com_content&task=view
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /editor/
Disallow: /help/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mambots/
Disallow: /media/
Disallow: /modules/
Disallow: /templates/
Disallow: /installation/
Disallow: /login/

This is a good security measure too, as it keeps google from indexing your core files, as well as your administration files. Note: Line 2 of the above example assumes you use a Search Engine Friendly URL component. The third line is a preventative measure to keep Google from even looking for PDF versions of your content in the future.

Some of the other significant steps I took include the following:

Keep your meta description accurate, and don't stuff it with information not relevant to your site or related to your site. Always write a separate meta description and keywords for your content items, which can be found on a tab in the editor. Keep your global site keywords short (10 or less), and only use keywords related to your site.

Have a good sitemap (I use Joomap), which automatically generates a Google sitemap. Also, you can enable automatic sitemap discovery, by adding the following line to the top line of your robots.txt file:

Sitemap:http://[URL TO SITEMAP PROVIDED BY JOOMAP OR OTHER MEANS]

Enable dynamic page titles, as this makes your site even less "spammy..."

If you don't use a Search Engine Friendly extension, you should consider one, as short URLs with keywords inside of them are more attractive to Google and human readers alike.

If you have extra directories you have added in your site's root, you should disallow them in robots.txt to prevent more supplemental result problems.

One strange thing I had to do is because I have a different situation than most Mambo and Joomla admins. The SEF functions in Mambo's core are changing between versions 4.5.5 and version 4.6, thus I am currently stuck using version 4.5.5 because I use the antiquated SEF404 component. This extension writes my SEF URLs, and it took some hacking to get it to work with my current version of Mambo. The major problem is that it isn't compatible with the sef_ext.php file provided for Remository, so I have SEF URLs for Remository turned off. Google will index this type of long URL, but they are blocked using the above example unless you add the following lines:


Allow: /index.php?option=com_remository
Disallow: /index.php?option=com_remository&itemid=*&func=startdown


The first allows Google to spider pages with index.php in them that only are part of the Remository component, so that my download pages and information are indexed. To prevent supplemental result Hell, the second line disallows the spiders from pages that have the "startdown" function in them, which would send the spider to the file to be downloaded.

That's it! You may want to look at my .htaccess example now, as it has some good security and antispam measures to save your site, as well as your bandwidth!


Comments
Add New RSS
Write comment
Name:
Email:
 
Website:
Title:
 
:angry::0:confused::cheer:B):evil::silly::dry::lol::kiss::D:pinch:
:(:shock::X:side::):P:unsure::woohoo::huh::whistle:;):s
:!::?::idea::arrow:
 
Please input the anti-spam code that you can read in the image.

3.25 Copyright (C) 2007 Alain Georgette / Copyright (C) 2006 Frantisek Hliva. All rights reserved."

 

© Matt Parnell's Brain: Plugged In!