User agent is a value passed to your website servers each time a bot or browser opens your website pages. User agent is useful for several traffic analysis tools, but you can also use it for other website maintenance. This article gives you some tips for user agent logging and reporting.
With search engines and traffic analytics, user agent has become more important to website owners. User agent is a way for website owners to detect who or what is accessing their sites. For instance, Google always uses the Googlebot user agent to find content for its search engine. Firefox browsers send a Mozilla user agent. User agent has several uses that you can leverage for your online business.
Identify Bots from Real Users
Web browsers and bots send a user agent value to your web server every time they open a website page. You can identify this value in your website code. You then use this value to log sessions in your database. For instance, you might want to create customized reports off of your own data and server logs. You would then take the user agent value from your visitor and insert it into a table. Query records from this table to analyze your site’s traffic.
When you create reports off of user agent, you also want to log the visitor’s IP address. With the IP address, you can verify that the user agent is not spoofed.
Block Search Engines from Crawling Certain Content
You probably have some content that you don’t want bots to crawl. As a matter of fact, indexing certain content can harm your search engine rank. For instance, tag clouds, thin pages and executable scripts should be blocked. To block search engines from crawling certain content, you identify user agent and block bots using robots.txt.
Robots.txt is a small text file that sits in the root directory of your website. The robots file has a specific format you use to tell bots what to crawl and what not to crawl. Note that some bots don’t honor the robots.txt file’s directives. In this case, your content still gets scraped and passed to another site. For this reason, never use a robots.txt file to block sensitive data. Always place sensitive data in a folder with permissions that block public access.
View Your Site as Google or Other Search Engines
While it isn’t recommended, some site owners detect user agent and show content based on the detected value. You can write scripts that send a specific user agent to specific pages. You can identify the HTML that a bot would see when it crawls your pages by testing with each user agent you work with.
This type of testing is good when you render content based on user agent and you need to test the content before releasing it to live servers.
One problem with user agent is that it can be spoofed. Scrapers are usually malicious scripts, but some scrapers use a unique user agent that identifies its crawling. To verify that user agent isn’t spoofed, you should also log the crawler’s IP address. Some malicious scrapers spoof Google’s user agent, so you miss the scraper’s activity.
For a scraper that uses its own unique user agent, you can use this value to know when your website gets scraped. You can then block that user agent from accessing your site. You can also block a scraper using its IP address.
You can’t fully rely on user agent to know who is browsing or crawling your site, but it’s still a useful tool for reporting. Be aware that user agent can be spoofed, and some bots detect cloaking by sending no user agent value. For instance, Google sends no user agent when it tries to detect malware on your site. Overall, user agent is a great way to create customized reports for your website traffic.