Tag: HTTPS
Robots.txt Http and Https - Part II
by admin on Mar.21, 2008, under SEO, Systems
So I posted earlier on splitting up the robots.txt you are showing based on if the bot was connecting on http or https here. I commented at the end that if this is an existing site and Google has already indexed your https content you need to be careful. Once you block Google from reindexing your https content using the robots.txt file Google will ignore changes, but it will keep the old indexed pages.
You cannot use the URL removal tool because you can’t tell it to only remove https urls. The way you get around this is to remove the robots.txt restrictions from https and put meta tags for Google to remove the content only on our https content.
Tags you are going to want to user are:
<meta name=”ROBOTS” CONTENT=”NONE”>
and
<meta name=”GOOGLEBOT” content=”NOARCHIVE”>
This will tell Google to not index the pages and remove any old caches and indexes. This could take some time for Google to crawl all of your https content. Once everything you want out of their index is gone you can go ahead and change over to the easier https robots.txt file restrictions.
I do not have the code that allows you to set different meta tag info based on http or https. If someone has that please comment on this post. This should help out the SEO on your site. Thanks
Robots.txt Http and Https
by admin on Mar.16, 2008, under SEO, Software, Systems
One of the big things a lot of people are doing these days is SEO or search engine optimization. I am not planning on going over a comprehensive guide for SEO right now but as I run into them in my own job I will post them here.
One of the more recent tasks I had was setting up a different robots.txt for our http site vs our https site. It is the same content but Google indexes them as two different copies of the content. Google then penalizes your site for duplicate content. Well you setup your robots.txt file to exclude the files and directories you want from the bots but it doesn’t help the http vs http. Both protocols will use the same robots.txt file and so you will end up with duplicate content in Google’s index.
Here is how to solve this problem when you are running IIS 6.0 on a Windows 2003 server. You will need to also be running ASP .Net 2.0 for this solution to work. You might be able to get this to work on other platforms but I have not tested them. What we will be doing here is creating a dynamic robots.txt file, it is only one file but it will display different results depending on if you connect with http or https.
1) Create your robots.txt file:
<%@ WebHandler Language="C#" Class="MyNamespace.robotshandler" %>
using System;
using System.Web;
namespace MyNamespace {
public class robotshandler: IHttpHandler {
public void ProcessRequest (HttpContext context) {
context.Response.ContentType = "text/plain";
context.Response.Write("User-agent: *\n");
if (context.Request.ServerVariables["Https"]=="off"){
// HTTP
context.Response.Write("Allow: /\n");
context.Response.Write("Disallow: /MyDisallowedDirectory/\n");
} else {
// HTTPS
context.Response.Write("Disallow: /");
}
}
public bool IsReusable {
get {return false;}
}
}
}
2) IIS needs to have .txt files passed through ASP .Net
- Open IIS and right click on your website and bring up the properties screen
- Go to Home Directory > Configuration. You will be on the Mappings Tab.
- Locate the ASPX item and click Edit - Copy the path in the Executable Field and cancel out of that window.
-Cancel
-Click “Add”
-Populate the Executable path with the value you copied in the last section
-Extension “.txt”
-Enter “GET” in the “Limit To” field
-Save and Exit
3) Need to modify web.config to process the .txt correctly
-Add the following under system.web: (Look for these sections in your web.config, most likely don’t have to add the httphandler tags because they will be in there already then just add the lines inside them.)
<httpHandlers>
<add path=”/robots.txt” verb=”GET” type=”System.Web.UI.SimpleHandlerFactory” />
<add path=”*.txt” verb=”GET” type=”System.Web.StaticFileHandler” />
</httpHandlers>
<buildProviders>
<add extension=”.txt” type=”System.Web.Compilation.WebHandlerBuildProvider” />
</buildProviders>
Make sure your new robots.txt file is in your root folder for your website and everything should be good to go. Try your website out with http then https and it should be different. You will have to modify the robots.txt file to exclude what folders and files you want.