SEO Finds In Your Server Log

Posted by timresnik I am a huge Portland Trail Blazers fan, and in the early 2000s, my favorite player was Rasheed Wallace. He was a lightning-rod of a player, and fans either loved or hated him. He led the league in technical fouls nearly every year he was a Blazer; mostly because he never thought he committed any sort of foul. Many of those said technicals came when the opposing player missed a free-throw attempt and ‘Sheed’ passionately screamed his mantra: “BALL DON’T LIE.” ‘Sheed’ asserts that a basketball has metaphysical powers that acts as a system of checks and balances for the integrity of the game. While this is debatable (ok, probably not true), there is a parallel to technical SEO: marketers and developers often commit SEO fouls when architecting a site or creating content, but implicitly deny that anything is wrong.    As SEOs, we use all sorts of tools to glean insight into technical issues that may be hurting us: web analytics, crawl diagnostics, and Google and Bing Webmaster tools. All of these tools are useful, but there are undoubtedly holes in the data. There is only one true record of how search engines, such as Googlebot, process your website. These are web server logs. As I am sure Rasheed Wallace would agree, logs are a powerful source of oft-underutilized data that helps keep the integrity of your site’s crawl by search engines in check.      A server log is a detailed record of every action performed by a particular server. In the case of a web server, you can get a lot of useful information. In fact, back in the day before free analytics (like Google Analytics) existed, it was common to just parse and review your web logs with software like AWStats .    I initially planned on writing a single post on this subject, but as I got going I realized that there was a lot of ground to cover. Instead, I will break it into 2 parts, each highlighting different problems that can be found in your web server logs:   This post: how to retrieve and parse a log file, and identifying problems based on your server’s response code (404, 302, 500, etc.). The next post: identifying duplicate content, encouraging efficient crawling, reviewing trends, and looking for patterns and a few bonus non-SEO related tips.  Step #1: Fetching a log file Web server logs come in many different formats, and the retrieval method depends on the type of server your site runs on. Apache and Microsoft IIS are two of the most common. The examples in this post will based on an Apache log file from SEOmoz.    If you work in a company with a Sys Admin, be really nice and ask him/her for a log file with a day’s worth of data and the fields that are listed below. I’d recommend keeping the size of the file below 1 gig as the log file parser you’re using might choke up. If you have to generate the file on your own, the method for doing so depends on how your site is hosted. Some hosting services store them in your home directory in a folder called /logs and will drop a compressed log file in that folder on a daily basis. You’ll want to make sure to it includes the following columns:   Host: you will use this to filter out internal traffic. In SEOmoz’s case, RogerBot spends a lot of time crawling the site and needed to be removed for our analysis.  Date: if you are analyzing multiple days this will allow you to analyze search engine crawl rate trends by day.  Page/File: this will tell you which directory and file is being crawled and can help pinpoint endemic issues in certain sections or with types of content. Response code: knowing the response of the server — the page loaded fine (200), was not found (404), the server was down (503) — provides invaluable insight into inefficiencies that the crawlers may be running into. Referrers: while this isn’t necessarily useful for analyzing search bots, it is very valuable for other traffic analysis. User Agent: this field will tell you which search engine made the request and without this field, a crawl analysis cannot be performed. Apache log files by default are returned without User Agent or Referrer — this is known as a “common log file.” You will need to request a “combine log file.” Make your Sys Admin’s job a little easier (and maybe even impress) and request the following format:   LogFormat “%h %l %u %t “%r” %> s %b “%Refereri” “%User-agenti”"   For Apache 1.3 you just need “combined CustomLog log/acces_log combined”   For those who need to manually pull the logs, you will need to create a directive in the httpd.conf file with one of the above. A lot more detail here  on this subject.     Step #2: Parsing a log file You probably now have a compressed log file like ‘mylogfile.gz’ and it’s time to start digging in. There are myriad software products, free and paid, to analyze and/or parse log files. My main criteria for picking one includes: the ability to view the raw data, the ability to filter prior to parsing, and the ability to export to CSV. I landed on Web Log Explorer (http://www.exacttrend.com/WebLogExplorer/) and it has worked for me for several years. I will use it along with Excel for this demonstration. I’ve used AWstats for basic analysis, but found that it does not offer the level of control and flexibility that I need. I’m sure there are several more out there that will get the job done.    The first step is to import your file into your parsing software. Most web log parsers will accept various formats and have a simple wizard to guide you through the import. With the first pass of the analysis, I like to see all the data and do not apply any filters. At this point, you can do one of two things: prep the data in the parse and export for analysis in Excel, or do the majority of the analysis in the parser itself. I like doing the analysis in Excel in order to create a model for trending (I’ll get into this in the follow-up post). If you want to do a quick analysis of your logs, using the parser software is a good option.    Import Wizard: make sure to include the parameters in the URL string. As I will demonstrate in later posts this will help us find problematic crawl paths and potential sources for duplicate content.     You can choose to filter the data using some basic regex  before it is parsed. For example, if you only wanted to analyze traffic to a particular section of your site you could do something like:      Once you have your data loaded into the log parser, export all spider requests and include all response codes:     Once you have exported the file to CSV and opened in Excel, here are some steps and examples to get the data ready for pivoting into analysis and action:    1. Page/File: in our analysis we will try to expose directories that could be problematic so we want to isolate the directory from the file. The formula I use to do this in Excel looks something like this.    Formula: =IF(ISNUMBER(SEARCH(“/”,C29,2)),MID(C29,(SEARCH(“/”,C29)),(SEARCH(“/”,C29,(SEARCH(“/”,C29)+1)))-(SEARCH(“/”,C29))),”no directory”)   2. User Agent: in order to limit our analysis to the search engines we care about, we need to search this field for specific bots. In this example, I’m including Googlebot, Googlebot-Images, BingBot, Yahoo, Yandex and Baidu.    Formula (yeah, it’s U-G-L-Y)   =IF(ISNUMBER(SEARCH(“googlebot-image”,H29)),”GoogleBot-Image”, IF(ISNUMBER(SEARCH(“googlebot”,H29)),”GoogleBot”,IF(ISNUMBER(SEARCH(“bing”,H29)),”BingBot”,IF(ISNUMBER(SEARCH(“Yahoo”,H29)),”Yahoo”, IF(ISNUMBER(SEARCH(“yandex”,H29)),”yandex”,IF(ISNUMBER(SEARCH(“baidu”,H29)),”Baidu”, “other”))))))   Your log file is now ready for some analysis and should look something like this:     Let’s take a breather , shall we?   Step # 3: Uncover server and response code errors The quickest way to suss out issues that search engines are having with the crawl of your site is to look at the server response codes that are being served. Too many 404s (page not found) can mean that precious crawl resources are being wasted. Massive 302 redirects can point to link equity dead-ends in your site architecture. While Google Webmaster Tools provides some information on such errors, they do not provide a complete picture: LOGS DON’T LIE.   The first step to the analysis is to generate a pivot table from your log data. Our goal here is to isolate the spiders along with the response codes that are being served. Select all of your data and go to ‘Data> Pivot Table.’   On the most basic level, let’s see who is crawling SEOmoz on this particular day:     There are no definitive conclusions that we can make from this data, but there are a few things that should be noted for further analysis. First, BingBot is crawling the site at about an 80% more clip. Why? Second, ‘other’ bots account for nearly half of the crawls. Did we miss something in our search of the User Agent field? As for the latter, we can see from a quick glance that most of which is accounting for ‘other’ is RogerBot — we’ll exclude this.    Next, let’s have a look at server codes for the engines that we care most about.     I’ve highlighted the areas that we will want to take a closer look. Overall, the ratio of good to bad looks healthy, but since we live by the mantra that “every little bit helps” let’s try to figure out what’s going on.    1. Why is Bing crawling the site at 2x that of Google? We should investigate to see if Bing is crawling inefficiently and if there is anything we can do to help them along or if Google is not crawling as deep as Bing and if there is anything we can do to encourage a deeper crawl.    By isolating the pages that were successfully served (200s) to BingBot the potential culprit is immediately apparent. Nearly 60,000 of 100,000 pages that BingBot crawled successfully were user login redirects from a comment link.      The problem: SEOmoz is architected in such a way that if a comment link is requested and JavaScript is not enabled it will serve a redirect (being served as a 200 by the server) to an error page. With nearly 60% of Bing’s crawl being wasted on such dead-ends, it is important that SEOmoz block the engines from crawling.    The solution: add rel=’nofollow’ to all comment and reply to comment links. Typically, the ideal method for telling and engine not to crawl something is a directive in the robots.txt file. Unfortunately, that won’t work in this scenario because the URL is being served via the JavaScript after the click.  GoogleBot is dealing with the comment links better than Bing and avoiding them altogether. However, Google is crawling a handful of links sucessfully that are login redirects. Take a quick look at the robots.txt  and you will see that this directory should probably be blocked.    2. The number of 302s being served to Google and Bing is acceptable, but it doesn’t hurt to review in case there are better ways for dealing with some of edge cases. For the most part SEOmoz is using 302s for defunct blog category architecture that redirects the user to the main blog page. They are also being used for private message pages /message, and a robots.txt directive should exclude these pages from being crawled at all.    3. Some of the most valuable data that you can get from your server logs are links that are being crawled that resolve in a 404. SEOmoz has done a good job managing these errors and does not have an alarming level of 404s. A quick way to identify potential problems is to isolate 404s by directory. This can be done by running a pivot table with “Directory” as your row label and count of “Directory” in your value field. You’ll get something like:     The problem: the main issue that’s popping here is 90% of the 404s are in one directory, /comments. Given the issues with BingBot and the JavaScript driven redirect mentioned above this doesn’t really come as a surprise.    The solution: the good news is that since we are already using rel=’nofollow’ on the comment links these 404s should also be taken care of.    Conclusion Google and Bing Webmaster tools provide you information on crawl errors, but in many cases they limit the data. As SEOs we should use every source of data that is available and after all, there is only one source of data that you can truly rely on: your own.    LOGS DON’T LIE!   And for your viewing pleasure, here’s a bonus clip for reading the whole post.   Sign up for The Moz Top 10 , a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

See more here:
SEO Finds In Your Server Log

PHP Errors as a Means of Getting Links

Posted by Eugene Krall This post was originally in YouMoz , and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc. 1. Using a Search Engine for Finding Faulty Sites I was reading the article about “Broken Link Building” the other day when I realized that there might be a possible extension to the idea of helping webmasters with keeping theirs sites together. Since there is a lot of stuff that can go wrong with a website, I started probing possibilities. Here is what I came up with. Definitely, there might be other problems with an internet site that might be noticed by an ordinary user. And in order to take a well-structured and organized approach, I had to find sites with a certain clear and present problem, and be able to find these sites in bulk. While thinking about this, I was doing my usual everyday routine when suddenly a php error popped out on the site I was browsing at the time. You are no doubt have encountered something like that lots of times. I knew for sure that I had seen such type of error A LOT. Only all of those times I was in an absolutely different frame of mind and had no idea how I could use it to my benefit. The site was related to my own (that’s why I was browsing it in the first place), and of pretty decent quality and value, so it made perfect sense for me to ask a responsible person for a link from it. Only all of us know that you do not simply contact a person and ask them to link to you right away. You wouldn’t, would you? From experience we all know that all our requests that generate little or no value to the requestee should better be based on relationships, even if they are established by one simple sentence which says “Hi man, it seems your PHP is getting out of control, you had better do something about it: ”. OK, let me explain everything step by step. Try more subtle approach, show them that you sent the message only because you felt that it would be appropriate to let them know about the problem, not because you wanted to use it to your benefit. “By the way, I was thinking if there is any chance that you can link to my site. It seems your visitors might be interested in this sort of thing. Anyway, size it up for yourself and if you of the same opinion, kindly add the link and let me know” But let me explain everything step by step. 2. PHP Notification Explained A lot of you probably have quite perfunctory understanding of PHP. So do I. The beauty of it is that you do not have to be a programmer to help webmasters with their PHP problems. I will try to explain in short what you should know in order to be ready to write a PHP error message. Let’s assume that you already know that PHP is a server-side scripting language. If something goes wrong and a php command/function can not be executed properly on a page loading in a browser, PHP engine throws up a notification on the page (like the one displayed on the picture above). Sometimes they are not displayed, though, if the webmaster has chosen the “not display notifications in browser” option in the PHP settings. There are several major types of notifications, but all of them are uniform, which makes it possible for us to find them on Google. Let’s take a look at a couple of examples: Warning: include(../inc_header.php) [function. include]: failed to open stream: No such file or directory in /home/actualad/public_html/hotel_soaltee_crowne_plaza. php on line 21 Notice: Uninitialized string offset: 0 in /var/www/odkryjpolske.pl/op3/functions/functions. php on line 1952 Fatal error: Call to undefined function tweetmeme() in /home/content/40/8396940/html/blog/wp-content/themes/magilas/single. php on line 54 Deprecated: Function ereg_replace() is deprecated in /var/www/virtual/sleepingparis.com/htdocs/admin/filemappa. php on line 18 Strict Standards: Non-static method DB::connect() should not be called statically in C:wwwAmautaphpdosmanosperuconnectiongateway.inc.php on line 28 I was able to spot five types of error notifications and then made an attempt to figure out a way to find them on Google so that the results were as relevant as possible. After taking a closer look I figured out that the part “php on line” was present in all the types of notifications. The only other part that seemingly remained the same were the words “Warning/Notice/Fatal Error/Deprecated/Strict Standards” 3. Seek… So, in order to get results containing pages with PHP error notifications, you should form queries: “warning:” [function." "php on line" "notice:" "php on line" "fatal error:" "php on line" "deprecated:" "php on line" "strict standards:" "php on line" But that might not be enough. Instead of getting search results with actual error notifications on faulty pages, you might stumble on a discussion of that error on some forum, or even the official PHP site. The solution is as follows: add some relevant keywords to your query (defining the type of site you want to deal with). Let's assume I have an online hotel reservation site and I want to get in touch with tour and hotel sites all over the world. I do the following: "warning:" [function." "php on line" intitle:tours "notice:" "php on line" intitle:tours "fatal error:" "php on line" intitle:tours "deprecated:" "php on line" intitle:tours "strict standards:" "php on line" intitle:tours The result is more than satisfactory one. There are 1,380,000 results for the query "warning:" [function." "php on line" intitle:tours and even the last hundred results out of 1000 displayed on Google are at least 50% relevant to what I was searching for. I mean the pages displayed indeed have a php error notification on them and offer tour services. But if you somehow feel that the results aren't relevant enough, you can always expand your search query by adding additional keywords. There is also a more thorough way to go. You may further brake down the types of PHP errors by the contents of a notification. Let's assume you have stumbled upon the warning notification which looks like: Warning: include_once(language/mn.php) [ function.include-once ]: failed to open stream: No such file or directory in /hermes/waloraweb061/b490/pow.sndmn/htdocs/destination/index.php on line 34 It is the easiest one to solve since it clearly states that the file or directory are missing. So all you have to write to the webmaster is “check your files and their names carefully” Let’s find the constant value of the message and delete all the information which changes (like the names of the files and folders, paths to them, etc.). And do not forget to attach your keywords! That’s what we get after some tweaking. “Warning:” “[function.include-once]: failed to open stream: No such file or directory in” “php on line” intitle:tours The query produces 132 results, which is quite something to start with since you already have the solution to the problem in your pocket. Now all you have to do is send your message to the respective parties in the results and wait for them to reply! 4. … and destroy! Usually, if you direct the attention of webmaster to the problem with their site, they should know what to do about it since they have probably earned being called “webmasters”. Still there are a considerable percentage of people who take care of their site to some extent without deep understanding of its mechanics. The site might have been created by a company for a client who does not know much in this sort of stuff. In this case, your ability to search for information on the Internet will enable you to warm up the hearts of mighty number of people who can add your link on their sites. What I am talking about here is trying to get to the bottom of the problem before contacting the webmaster so that not only could you state the presence of it, but also could help with figuring things out. It’s nice if you are into PHP and can crack any related problem without referring to World Wide Web. If, however, you are not that sort of person, you might want to read some specialized PHP forums: PHP Freaks SitePoint PHP Forum PHP Help Codewalkers PHP-Related Forum If the search comes up with a page which contains a following notification: Warning: Cannot modify header information – headers already sent by (output started at /home/zungahto/public_html/includes/joomla.php:836) in /home/zungahto/public_html/includes/joomla.php on line 697 you have enough information to be able to find a solution. Let’s try to perform the following search: site:forums.phpfreaks.com “Warning: Cannot modify header information – headers already sent by” site:sitepoint.com “Warning: Cannot modify header information – headers already sent by” Google comes up with as many as 4,350 and 4,700 results for forums.phpfreaks.com and sitepoint.com respectively, which is one damn mighty pile to browse through. You might want to look through the top 10 and send the links to the discussions you deem appropriate in your message. Another way to go is just simply send the url of the forum so that the recipient could start a new thread for themselves to address their problem exclusively. Of course it’s not an exhaustive solution, but it will give the webmaster something to lean on. If you want to go hardcore though, you can plug in one of your company’s programmers to give advice to the recipient (only I think it should be one hell of a good site to go into such extremities for a link). Finally, check out this list of common PHP error messages . If you want to proceed with the idea of searching for a specific sort of error, you might want to read this one and then continue on with the search. Let’s assume I have chosen to proceed with the “use of undefined constant” error. I have some sort of solution already, poor as it is. (It’s in the document provided above). So all you need to do is search for this sort of messages: “Notice:” “Use of undefined constant” “php on line” intitle:tours And heeeere we go! 59 results. 5. Checking the Activities of the Website Further on, the fact that an error notification has been around long enough for a search engine to index it means that the site is getting seriously out of hand and might have been completely abandoned by the crew, so before sending a message insure in some way that the site is still in business. I do it by performing these steps: Enter the query consisting of the advanced “site” operator along with the domain of the site in question. ( site:phperrorsite.com ) Collapse the list of the Google Search Tools under the “Show Search Tools” link on the left column of Google search result page. In the time selection section choose the “ past year ” option The options “ sort by relevance ” and “ sort by date ” will appear; choose the latter. Now pay attention to the date in the first search result. The more recent it is, the better. This, along with the number of the documents in the search results pertaining to the domain, gives you an approximate idea of what is going on with the site. Remember, this is far from precise and I would be glad to hear of any other ideas about how one can discover whether a site has been abandoned or not. 6. Taking a Deeper Look into the Problems of a Website You may also want to check whether there are any other php errors of one sort or another by performing a respective search query modified by the “site” search operator: site:phperrorsite.com “warning:” [function.” “php on line” The higher the number of the documents found, the higher the likelihood that the site owners/editors just do not give a rat’s ass about their website and won’t probably respond to your message of good intention. Now you have ensured that the site is still kept a close watch on by finding out that there is only one document with a php error, which, in its turn, happens to be in some secluded corner of the site and might simply have been missed by the webmaster. There is no excuse for a php error notification right above the header on the main page of a site. I think twice before contacting sites of that sort. Like, how could they have missed that and even gave time for a search engine to index it!? 7. Composing a PHP error notification email Well, I guess I have fed you all I had on the subject. Now for the pivotal point – email composing. I guess it’s one of the most popular subjects among SEOmoz blog writers, so I do not want to write something that has been written thousands of times before me. Try to check out the most recent article on the subject (at the time of this writing). Nevertheless, here is my way of doing business. I hope this example might be of use to someone. Subject: I was browsing through your site phperrorsite.com out of professional curiosity (I maintain an online hotel reservation site myself) and came upon a page that was obviously getting out of hand: I understand how tiresome it might be to keep everything in line, so I have attached a document with possible solution to this problem . There are also quite a few discussion on the internet pertaining to this problem, check them out if you will: However, I would recommend you to start a new thread on one of the forums laying down all the details. Please drop me a line at your convenience; I would like to know how things worked out for you. Best of luck with your work! You could omit some of the stuff like specifying the urls of forum threads. To keep it simple, you could just mention the page with the error and ask them to get back to you. After the reply, you can move on to asking for a little favor. Hopefully, you will be granted one! Hi again,

Read this article:
PHP Errors as a Means of Getting Links

8 Ways to Find Old URLs After a Failed Site Migration – Whiteboard Friday

Posted by iPullRank In this week’s Whiteboard Friday, we are going to be going through some different ways you can track down old URLs after a site migration. These tactics can be incredibly useful for new clients that have just performed a redesign with less than ideal preparation. I’ll be presenting eight ways for you to track down these old URLs, but I would love to see some of your own methods in the comments below. Happy Friday everyone! Video Transcription Greetings and salutations SEOmoz fans. My name is Michael King. I’m the Director of Inbound Marketing at iAcquire. I’m also iPullRank on the SEOmoz boards and on Twitter. So today what we’re going to talk about is eight ways to figure out old URLs after a failed site migration. I know you have this problem. You get a new client, they just redesigned, and you have no idea what the old URLs are. They didn’t do 301 redirects. They have no idea what the social numbers are anymore, and you have no idea where to start. Well, I’m going to show you how. Now one of the first tactics you want to use is the Wayback Machine. You just put the site in there, the URL, the domain, what have you, and see what it has in that index. Once you get that, you can easily just pull off those URLs on the site through the links using Scraper for Chrome or whatever tool you want to use. You can actually pull down a code and pull them out using Find and Replace, whatever you want to do. That’s just one of the tactics that we’re using. A lot of times people will also not change or update their XML sitemap. So you can just download that XML sitemap and then open it in Excel, and it puts you in these tables. You can just take that first column and copy and paste it into a text file, open it in Screaming Frog, and then crawl and list mode to see if those URLs still exist. Anything that’s a 404, that’s a URL that you can use, and you can easily map those ultimately to the new URLs on that site. You also want to use your Backlink profile. When I say that, I don’t want you to essentially use one tool, I want you to use as many tools as possible. So definitely start from Open Site Explorer. Also use Majestic, Ahrefs, whatever you want to use, and collect as much link data as possible. Also Webmaster Tools has your links, so use those as well. Then crawl all those links, all the targets of those links and make sure those pages are still in existence. All the 404s, again, you know these are old URLs that you can then redirect to new pages. Then you also want to check the 404s from Google Webmaster Tools and map those pages to new pages as well. Then you can also use analytics. So pull your historic analytics from before the site redesign and find all those URLs and see which ones are still in existence. Again, go back to Screaming Frog with list mode and make sure that they’re 404ing or 200ing. The ones that are 200, you don’t have to worry about. The ones that are 404s are the ones that you need to remap. Then you can also use CMS Change Log. So, for example, when you make a change in WordPress to a URL, there’s a record of that, and you can actually pull those URLs out and use those again for mapping. Then, for those of you that are a little more adventurous, you can go into your log files and see what URLs were driving traffic before it. Same thing as what you would do with the analytics, but just from a server side standpoint rather than just your click path stuff. And also social media. So people share these URLs. Any shared URL has equity beyond just link equity. So you definitely want to make sure that you’re pushing those social shared numbers to the right URLs that you’re mapping towards, and I wrote a post on that on Search Engine Watch for how you can do that. But you can use the Facebook recommendations tool. So it’s not really a tool. It’s a demo for widget that goes on your site. But essentially, you can go through this tool and put in the domain name, and it’s going to give you all the shared URLs, all the shared content. The way it comes in the box is it’s 300 pixels tall, but if you expand that to a 1,000 pixels, you’ll see the top 20 pieces of content that were shared. So real easily identify a popular URL that you can then redirect. Also you can Topsy the same way. If people have tweeted these URLs, you can just put that domain name in there. It’s going to search for them. It’s going to give you all the URLs that Topsy has indexed. You can also use Social Mention, any social listening tool you can use the same way. And then also social bookmarks, so things like Digg, Delicious, and such, look and see what people have actually shared and bookmarked for your site. So that’s a quick one. Hope you guys found that useful, and I’d love to know how you guys have found this to be worthwhile. So holler at me in the comments down there, and thanks very much. Peace. Video transcription by Speechpad.com Sign up for The Moz Top 10 , a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Read this article:
8 Ways to Find Old URLs After a Failed Site Migration – Whiteboard Friday