Valid GoogleBot blocked

Home Forums BulletProof Security Pro Valid GoogleBot blocked

Tagged: 

Viewing 12 posts - 1 through 12 (of 12 total)
  • Author
    Posts
  • #7892
    Schneider
    Participant

    I noticed the following entry in my security log: I checked the IP and it is from Google. AFAIK this means that access to my homepage was blocked. This would not be good. How can I better understand what was going on and how to prevent the block in the future?

    >>>>>>>>>>> 403 GET or Other Request Error Logged - 22. Juli 2013 - 12:48 <<<<<<<<<<<
    REMOTE_ADDR: 66.249.78.86
    Host Name: crawl-66-249-78-86.googlebot.com
    SERVER_PROTOCOL: HTTP/1.1
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER:
    REQUEST_URI: /
    QUERY_STRING:
    HTTP_USER_AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    #7894
    AITpro Admin
    Keymaster

    BPS does not block the Googlebot or any good/legitimate bots.  You can confirm this by logging into your Google Webmaster Tools account and using the Fetch as Google tool in Webmaster Tools.  If you do not have a Google Webmaster Tools account then you can create an account in the link below.
    https://www.google.com/webmasters/tools/

    This appears to be the Googlebot, but there are no guarantees that this really is the Googlebot.  IP addresses, hostnames and user agents can all be easily faked. If an URL contains dangerous coding characters, such as the single quote code character/apostrophe in the URL, then it will be blocked, but that does appear to be the case in this logged error. Another random thing that occurs is that a legitimate crawl contains several different elements to the script and one of those elements is blocked, but not the important parts of the script such as doing a general crawl and more importantly indexing a post/page/etc.

    In summary, check your site URL using the Fetch as Google tool in Google Webmaster Tools to confirm that everything is actually fine. Additional Googlebot help just for general FYI info https://support.google.com/webmasters/answer/182072

    #26215
    alexb
    Participant

    I have a problem, which is that googlebot is being blocked. I get this in my security log (strangely on only one of my sites that is configured exactly like all the others where I didn’t get that log entry):

    [403 GET|HEAD Request: November 12, 2015 9:45 am]
    Event Code: PFWR-PSBR-HPR
    Solution: http://forum.ait-pro.com/forums/topic/security-log-event-codes/
    REMOTE_ADDR: 66.249.65.89
    Host Name: crawl-66-249-65-89.googlebot.com
    SERVER_PROTOCOL: HTTP/1.1
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER: http://www.mysite.com/
    REQUEST_URI: /wp-content/plugins/contact-form-7/includes/js/jquery.form.min.js?ver=3.51.0-2014.06.20
    QUERY_STRING:
    HTTP_USER_AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

    It seems that in the security log, if I add the useragent googlebot to the ignore list, then that just won’t be logged anymore – but it would still get blocked, correct?

    How can I whitelist Googlebot from accessing my site in BPS Pro?  How can I find out why it is getting blocked in the first place?

    Note: I did the “fetch as google” in webmaster tools right before these errors would appear in my log. This is a brand new site not yet indexed in Google (and a domain that has never been used before in the past), so the chances of this being an attack with fake googlebot useragent is almost nonexistent. In webmastertools, Google can access the site but it can’t access certain stylesheets etc. – exactly those that show up in the BPS security log!

    Edit: I just discovered this in the logs of older sites, too. So it’s definitely not a 1-site-problem. I have a few custom codes, but only those from here (about half of those that are proposed in the notices at the top of the wordpress dashboard after installing BPS pro) – could it be one from them?

    Edit 2: Interesting, after some more troubleshooting, on all fetches (with rendering) in webmastertools it says “error” for exactly those .js files protected by the PF. And indeed, when I turn off PF google can access these resources without issues again. Any way around this?
    ________________
    Ok one more thing after further testing, this doesn’t appear to happen on all sites, just some. Others can be crawled totally fine without PF causing any googlebot errors or even security log entries. Like I said, all my sites have the same custom code/settings etc., so not sure why this would happen on only some and not all or none. Any ideas?

    #26229
    AITpro Admin
    Keymaster

    First, it is a common logical belief that a brand new website will not get crawled or attacked by Bots, but that is not true since bots automatically crawl or attack any website that they find at any time.  Bots crawl randomly and automatically all day every day.

    Next, this file: /wp-content/plugins/contact-form-7/includes/js/jquery.form.min.js should not be crawled by the Googlebot at all since it does not contain any content for indexing and is instead a frontloading js plugin script for the Contact Form 7 plugin.

    So instead of trying to check this file with Google fetch tools, what is more important to check is that the Plugin Firewall and Plugin Firewall AutoPilot Mode are functioning correctly on this website.  Do the steps below.

    Fix all general Plugin Firewall issues/problems:
    1. Go to the BPS Security Log page and click the Delete Log button to delete your current Security Log file contents.
    2. Go to the Plugin Firewall page.
    3. Click the Plugin Firewall BulletProof Mode Deactivate button.
    4. Delete (or cut if you want to add your existing whitelist rules back into the Plugins Script|File Whitelist Text Area) all of your Plugin Firewall whitelist rules out of the Plugins Script|File Whitelist Text Area.
    5. Click the Save Whitelist Options button.
    6. Click the Plugin Firewall Test Mode button.
    7. Check your site pages by clicking on all main website pages: contact form page, home page, login page, etc.
    8. Recheck the Plugins Script|File Whitelist Text Area (after 1 minute) and you should see new Plugin Firewall whitelist rules have been created.
    9. Change the AutoPilot Mode Cron Check Frequency to 15 minutes or whatever frequency time you would like to use.
    10. Click the Plugin Firewall Activate button.

    #26246
    alexb
    Participant

    Fair enough regarding bots and new sites, but if the site is not even indexed in any search engine, and if that particular domain name was never used before, the only way for a bot to actually find it is to randomly put together letters and domain extensions to land on it. Winning the lotto jackpot should be more likely than this since you would only have to get 6 guesses right, not 16 (like the characters of a domain name). Or are there other ways for bots to find a site? Since the domain was never in use before, there are of course no links pointing to it either.

    Anyway, the belief that .js and .css files are useless for bots (some even suggest they should be blocked from accessing these filetypes in robots.txt) is quite an old one. Google stated several times in the past few years that access to css and js files should not be blocked, some people (incl me) even got an email in webmaster tools that blocking these resources may have a negative impact on indexing and can result in “suboptimal rankings”. More info: http://googlewebmastercentral.blogspot.co.at/2014/10/updating-our-technical-webmaster.html

    I guess the real reason they want to access this stuff is so they can more easily detect blackhat spammers that try to hide links or do other shady stuff in the sourcecode.

    The PF was functioning correctly, and after following your steps, the rules that were re-created appear to be the same as before. However, now google is able to fetch the site correctly, so looks like I have to do this manual PF reset on all sites where this issue occurs. Thanks for the fix!

    #26250
    AITpro Admin
    Keymaster

    There are other ways bots find websites.  Trying to explain that would take quite a lot of explanation so I am not going to attempt to explain that.  😉  It has nothing to do with blocking or not blocking js or css files.  The Googlebot should only be indexing content (post text) and not code in a js or css files so there really is not any reason for the Googlebot to be crawling those scripts for indexing the results in the Google DB search engine.  A logical explanation would be that there was an invalid Plugin Firewall whitelist rule or blank/whitespace somewhere.  So yes, if this same problem is occurring on other sites then the manual Plugin Firewall steps and having the Plugin Firewall AutoPilot Mode create 100% valid/correct whitelist rules is the best method to use to correct any problems/invalid code/etc.

    #26252
    alexb
    Participant

    Thanks.

    Maybe you should have a quick read here: http://googlewebmastercentral.blogspot.co.at/2014/10/updating-our-technical-webmaster.html I agree that indexing post text/content has nothing to do with js or css, but at the end of the day none of us want to get lower rankings because we blocked the google bot from accessing those files, hence getting this solved was kind of a big deal to me 😉

    #26255
    AITpro Admin
    Keymaster

    Yep, read the Google post.  Ok I’ll try this again:  Google should not be blocked from crawling any frontloading js or css scripts that load on the frontend of your website since they need to be publicly accessible to all search engines and all visitors to your website, but the Googlebot does not index js or css scripts.  The Plugin Firewall blocks/restricts access to plugin files in your /plugins/ folder that only load on the backend of your website from everyone (Google, etc) except your current IP address since those plugin js, etc. scripts do not need to be accessible publicly to anyone except your website/you.

    #26258
    alexb
    Participant

    Thanks for your patience. Then I don’t understand why the googlebot fetch would not work and why it tried accessing scripts that apparently only load in the backend. Yet with a PF whitelist rule reset this issue got fixed, so it clearly had to do with that.

    Don’t want to get on your nerves but when I checked out the source code (not logged into WP, different browser) of any post on my site, I do see those .js and .css files in the source code (= frontend).

    #26260
    AITpro Admin
    Keymaster

    The contact form 7 js script in the Security Log entry that you posted is a frontloading js script and not a backend js script so once you fixed the Plugin Firewall problem then Google was able to successfully access that frontloading js script.  And you would not be getting on my nerves since I am paid by the hour and this is my job. 🙂

    #26269
    alexb
    Participant

    Thanks. Hopefully the one-time payment of BPS Pro is enough to keep up amazing support like this!

    #26270
    AITpro Admin
    Keymaster

    Yep, definitely no problem with that.  The sales people on the other hand work 10 hour grueling shifts.  Glad I’m not doing that stuff.  😉

Viewing 12 posts - 1 through 12 (of 12 total)
  • You must be logged in to reply to this topic.