Spying on Google: How to use log file analysis for SEO


Log analysis should be part of every SEO professional’s toolkit, but most optimizers have never done it. This means that most SEO professionals are missing out on unique and invaluable ideas that conventional scanning tools simply cannot give.

Let's demystify the analysis of the log file so that it is not so frightening. If you are interested in the amazing world of log files and what they can bring to the audit of your site, this guide is definitely for you.

What are log files?

Log files are files that contain detailed logs about who makes requests to the server of your site and what. Each time a bot sends a request to your site, data (such as time, date, IP address, user agent, etc.) is stored in this log. This valuable data allows any SEO specialist to find out what Googlebot and other crawlers do on your site. Unlike conventional crawls, such as SEO Spider Screaming Frog, this is real data, not an estimate of how your site is crawled. This is an accurate overview of how your site is scanned.

Having this accurate data can help you identify costs, easily find access errors, understand how your SEO efforts affect crawling, and much, much more. The best part is that in most cases this can be done with simple spreadsheet software.

In this tutorial, we will focus on Excel to analyze log files, but I will also cover other tools.

How to open log files

Rename .log to .csv

When you get a log file with a .log extension, it really is as easy as renaming the .csv file extension and opening the file in a spreadsheet program. Do not forget to configure the operating system to display file extensions if you want to edit them.

How to open split log files

Log files can be included in either one large log or several files, depending on the server configuration of your site. Some servers will use server load balancing to distribute traffic between a pool or a server farm, which will split log files. The good news is that it is really easy to combine, and you can use one of these three methods to combine them and then open them as usual:

1. Use the command line in Windows by pressing Shift + right-click in the folder containing the log files and select "Start Powershell from here"

Then run the following command:

copy * .log mylogfiles.csv

Now you can open mylogfile.csv and it will contain all your log data.

  Or, if you are a Mac user, first use the cd command to go to the directory of your log files:

cd Documents / MyLogFiles /

Then use the cat or concatenate command to merge your files:

cat * .log> mylogfiles.csv

 2. Using the free Log File Merge tool, merge all the log files, then edit the file extension to .csv and open as usual.

Line splitting

After you open the log file, you will need to break the bulky text in each cell into columns to simplify the subsequent sorting.

Here, the Text to Column function in Excel is very convenient, which is as simple as selecting all the filled cells (Ctrl / Cmd + A), switching to Excel> Data> Text in columns and selecting the Separator option.

After you separate it, you can also sort by time and date.

Understanding Log Files

Now that your log files are ready for analysis, we can dive in and begin to understand our data. There are many formats that log files can accept with several different data points, but usually they include the following:

1. Server IP

2. Date and time

3. Server request method (for example, GET / POST)

4. Requested URL

5. HTTP status code

6. User Agent

How to quickly identify budget crawl

The crawl budget is the number of pages that the search engine crawls each time you visit your site. Many factors affect the crawl budget, including link equality or domain authority, site speed, and more. By analyzing the log files, we will be able to see which budget crawls your web site and where problems arise that lead to a loss of crawl budgets.

Ideally, we want to give scanners the most efficient scanning experience. Scanning should not be spent on pages with a low URL value, and on priority pages (for example, on product pages) there should be no slower indexing and scanning speed. Remember that good budget scan conversions are the best organic search performance.

View crawled URLs by user agent

Seeing how often site URLs are viewed, you can quickly determine where the search engines are wasting their time crawling.

If you are interested in looking at the behavior of a particular user agent, it’s just how to filter the corresponding column in Excel. In this case, using the WC3 format log file, we filter the cs column (User-Agent) using Googlebot.

And then filter the URI column to show how many times Googlebot crawled the home page of this sample site.

This is a quick way to find out if there are any problem areas by URI for an individual user agent.

From this main menu, we can see which URLs, including resource files, are scanned in order to quickly identify any problem URLs (for example, parameterized URLs that should not be crawled).

Understanding which robots are scanning, how mobile robots are scanning on desktops, will help you to immediately see where budget losses are occurring during scanning and which areas of the site need to be improved.

Find Low Value URLs

Bypass budget should not be spent on low value-added URLs.

Go back to the log file and filter by URLs that contain “?” Characters or a question mark from the URL column (containing the base URL). To do this in Excel, do not forget to use the "~?"

Find duplicate URLs

Duplicate URLs can be a waste of budget and a big SEO problem, but finding them can be a problem. URLs can sometimes have small variations (for example, apostrophes and without apostrophes).

Ultimately, the best way to find duplicate URLs is to sort the URLs of the site alphabetically and manually view everything.

View the frequency of scanning subdirectories

Figuring out which subdirectories are viewed most often is another quick way to identify scanning costs. But keep in mind that if a customer blog has never received any backlinks and only receives three views per year from the grandmother of the business owner, this does not mean that you should consider this scanning unnecessary. The internal link structure must be consistently good throughout the site.

View scan frequency by content type

Finding out what content is being scanned, or whether there are any types of content that take up too much of a scanning budget, is an excellent test to identify the loss of a scanning budget. Using this tactic, you can easily detect frequent scanning of unnecessary CSS or JS files with low priority or how images are scanned if you are trying to optimize image search.

There are many methods for analyzing file scanning. All this is spying on Google bots. There are many such methods and we have shared only a few of them.

Conclusion: analyzing log files is not as scary as it sounds

With a few simple tools at your disposal, you can immerse yourself in the behavior of Googlebot. When you understand how a website handles scanning, you can diagnose more problems. The real power of analyzing log files is that you can test your theories on Googlebot and extend the methods described above to collect your own ideas and revelations.

Futureinapps company is engaged in SEO website promotion for businesses.