Robots.txt Http and Https
by admin on Mar.16, 2008, under SEO, Software, Systems
One of the big things a lot of people are doing these days is SEO or search engine optimization. I am not planning on going over a comprehensive guide for SEO right now but as I run into them in my own job I will post them here.
One of the more recent tasks I had was setting up a different robots.txt for our http site vs our https site. It is the same content but Google indexes them as two different copies of the content. Google then penalizes your site for duplicate content. Well you setup your robots.txt file to exclude the files and directories you want from the bots but it doesn’t help the http vs http. Both protocols will use the same robots.txt file and so you will end up with duplicate content in Google’s index.
Here is how to solve this problem when you are running IIS 6.0 on a Windows 2003 server. You will need to also be running ASP .Net 2.0 for this solution to work. You might be able to get this to work on other platforms but I have not tested them. What we will be doing here is creating a dynamic robots.txt file, it is only one file but it will display different results depending on if you connect with http or https.
1) Create your robots.txt file:
<%@ WebHandler Language="C#" Class="MyNamespace.robotshandler" %>
using System;
using System.Web;
namespace MyNamespace {
public class robotshandler: IHttpHandler {
public void ProcessRequest (HttpContext context) {
context.Response.ContentType = "text/plain";
context.Response.Write("User-agent: *\n");
if (context.Request.ServerVariables["Https"]=="off"){
// HTTP
context.Response.Write("Allow: /\n");
context.Response.Write("Disallow: /MyDisallowedDirectory/\n");
} else {
// HTTPS
context.Response.Write("Disallow: /");
}
}
public bool IsReusable {
get {return false;}
}
}
}
2) IIS needs to have .txt files passed through ASP .Net
- Open IIS and right click on your website and bring up the properties screen
- Go to Home Directory > Configuration. You will be on the Mappings Tab.
- Locate the ASPX item and click Edit - Copy the path in the Executable Field and cancel out of that window.
-Cancel
-Click “Add”
-Populate the Executable path with the value you copied in the last section
-Extension “.txt”
-Enter “GET” in the “Limit To” field
-Save and Exit
3) Need to modify web.config to process the .txt correctly
-Add the following under system.web: (Look for these sections in your web.config, most likely don’t have to add the httphandler tags because they will be in there already then just add the lines inside them.)
<httpHandlers>
<add path=”/robots.txt” verb=”GET” type=”System.Web.UI.SimpleHandlerFactory” />
<add path=”*.txt” verb=”GET” type=”System.Web.StaticFileHandler” />
</httpHandlers>
<buildProviders>
<add extension=”.txt” type=”System.Web.Compilation.WebHandlerBuildProvider” />
</buildProviders>
Make sure your new robots.txt file is in your root folder for your website and everything should be good to go. Try your website out with http then https and it should be different. You will have to modify the robots.txt file to exclude what folders and files you want.
1 Comment for this entry
1 Trackback or Pingback for this entry
-
Super-Networking Blog » Blog Archive » Robots.txt Http and Https - Part II
March 21st, 2008 on 1:18 pm[...] on splitting up the robots.txt you are showing based on if the bot was connecting on http or https here. I commented at the end that if this is an existing site and Google has already indexed your https [...]
March 17th, 2008 on 8:47 am
One thing to be aware of is this should be done when you are first setting up your site. You can run into problems if you put this into place on an existing site. What can happen is you tell Google not to reindex your https pages with your robots.txt but it will keep around the old ones it has already indexed.
You cannot use the URL removal tool to get rid of your https version of your site unless you want to get rid of your http content as well. What you can do is add some meta tags to your https content to tell Google to get rid of the old content. I will try to post on how to do this at a later time.