We often need to hide or noindex certain pages or groups of pages in WordPress so that they will not show up in search engines. Tag archives (like this one) are good examples of pages that should be noindexed because they could potentially contain what search engines consider to be “duplicate content”. Other types of pages that are good candidates for noindexing include thank you pages, author archive pages for single author blogs, etc. We just usually fix this using SEO plugins.
What about files? Sometimes you might host a PDF or Excel file on your website, and for a number of different reasons, you may not want that file to be indexed by search engines.
For example, if you use a PDF whitepaper to generate leads, you may not want to lock down your whitepaper behind a paywall. You want them to be readily downloadable by anyone who has a link to the file. At the same time, you wouldn’t want people to find your whitepaper via a Google search either. Instead you only want the file to be accessible to people who have shown interest by sending your their email address (or for some other reason, you may only want the file to be accessible via the page on your site that contains the link to the file).
Since this is not a regular web page, the regular methods for noindexing pages may not work for your file and the question becomes: How do we hide (noindex) files in WordPress from search engines?
I explain a solution to this problem below and I demonstrate with live example files how you can hide (noindex) other types of media like PDF and Excel files in WordPress.
But before we get too technical, you may want to note that…
You May Be Able To Solve This Problem With The Yoast SEO Plugin
All files are considered as “media” in WordPress and the popular Yoast SEO plugin allows you to define how you want search engines to treat each media file on your WordPress website. So with just a few clicks, you may be able to add noindex, noarchive, and nosnippet meta tags for your media files really quick.
From the edit page of the media file in question, the Yoast SEO settings you need to configure will look like this:
Easy eh? Well, maybe not.
The Yoast SEO media settings shown in the above screenshot are ONLY available if you have NOT already configured Yoast SEO to “Redirect attachment URLs to the attachment itself”.
Notice that this is the recommended setting. You need this setting for all the other image files contained on your site. And with this setting, you will not see any Yoast SEO settings box on your media edit pages.
So, if you’re using recommended Yoast SEO settings, you will need another way to specify your noindex tag on your PDF or document file. This is also the case if you’re not even using Yoast SEO in the first place. Or if you’re on an Apache web host but not using the WordPress content management system.
My solution below uses the .htaccess file and does not care about Yoast SEO or even WordPress. It only needs an Apache host.
Hide (NoIndex) Files Using .htaccess
Here are two sample PDF files. The first one has been noindexed in my .htaccess file while the second one has not.
And here is the snippet of code added to my .htaccess file to achieve this:
<FilesMatch "EhiTestNoIndexed.pdf"> Header set X-Robots-Tag "noindex, noarchive, nosnippet" </FilesMatch>
If you are new to working with the .htaccess file, then you might want to check out my detailed article on the subject: Working With The .htaccess File.
If you wanted to noindex both files, you could get a little fancy with regular expressions like this:
<FilesMatch "^(EhiTestNoIndexed|EhiTestIndexed)\.pdf$"> Header set X-Robots-Tag "noindex, noarchive, nosnippet" </FilesMatch>
Why Not Just Disallow The Files Using robots.txt?
Because the instructions defined in your robots.txt file do not prevent search engines from indexing a file or web page.
True, you can stop search engines from crawling a resource using the robots.txt file. But if someone links to your file or page from a third party website, search engines will go ahead and index your file if they do not find an explicitly defined noindex tag on it.
The X-Robots-Tag directive defined in the .htaccess file as described above is needed in this case.
How Do We Test This?
Install the plugin if you don’t already have it. Then for the noindexed file, visit this link: EhiTestNoIndexed.pdf
Now click the “Information” tab of the Web Developer Chrome extension and press “View Response Headers”.
The result will be something like this:
date: Sat, 24 Nov 2018 23:11:59 GMT last-modified: Sat, 24 Nov 2018 23:08:12 GMT server: Apache/2.4.18 etag: "b881-57b712c8fa5ac" content-type: application/pdf accept-ranges: bytes x-robots-tag: noindex, noarchive, nosnippet content-length: 47233 200 OK
Notice the line with “x-robots-tag: noindex, noarchive, nosnippet”. This means search engines will not index this file.
If you repeat the same process for the EhiTestIndexed.pdf file, the response header will look like:
date: Sat, 24 Nov 2018 23:13:23 GMT last-modified: Sat, 24 Nov 2018 23:08:59 GMT server: Apache/2.4.18 accept-ranges: bytes etag: "b991-57b712f5e943b" content-length: 47505 content-type: application/pdf 200 OK
This one will be indexed and will show up in search engines.