Under review

Trouble with bots and crawlers

Mike Stickler 10 years ago in BLOX CMS updated by Patrick O'Lone (Director of Software Development) 10 years ago 6
Is anyone having bandwith overages and when they try to track them back discover msnbot is accounting for 24%-33% of total page views. I am loathe to block them completely but changing the crawl delay had no effect? 

By way of comparison googlebot, yahoo(slurp), and bingbot all seem to account for about 6% of our pageviews. 

How are you dealing with the issue?
Under review
Have you tried setting up webmaster tools in Bing? The 'crawler delay' feature was something that bots experimentally toyed with, but ultimately, have shuffled those kind of controls into the webmaster tools portion of those systems. You can control throttling there in at least Microsoft and Google's implementations. Frustratingly, all of the providers run different bots against the same content for different purposes.
In fact I did and did notice some effects in the behavior of bingbot. After I updated sitemaps, settings, in the bing webmaster tools I noticed an uptick in the crawl rate by bingbot which then dropped off to about half it's previous consumption.

Neither the webmaster tools nor the crawl delay seemed to have any impact on msnbot-newsblogs or msnbot-udiscovery. Those two bots were accounting for 1/3 of our total pages served, Your staff did confirm the ip addresses the activity was coming from was Microsoft's but in the end I disallowed them. I have not seen any drop in our search results from msn.com/bing but did notice a lack on our content on news.msn.com. That concerns me a bit, but I had not seen any referrals from that domain in months anyway. 
Under review
What did you set crawl delay too? You might need to explicitly target msnbot. I saw this older post on their blog:

Originally it was not set. I set the crawl delay to 3 and left it for a week. There was no noticeable decrease in msnbot's activity, so I explicitly targeted msnbot and set it to 6. Same results so I changed it to 10... 

I checked with dustin at townnews and he confirm the robots.txt was formatted correctly. His suspicion was that msnbot was not reading the robots.txt file every crawl instead it was caching that info and may have missed some of my changes. There is evidence to support this as it was several days after I set the disallow before the crawl activity from msnbot actually ceased. I may go back again and allow the crawl but set the delay back at ten. I'll leave it like that until we start seeing activity.
I still have the 2 msnbot's blocked and they are not crawling our site. However after talking to contacts at Microsoft, I have specifically re-enabled BingBot. It is the primary bot used for seach results and does respect crawl delay directives. It should be noted that if you block msnbot it will also block bingbot unless you specifially allow it. The two msnbots were combining for 25% of my total hits. Bingbot is now consuming less than 5%.
Some template changes to add nofollow to links where it bots might get into trouble - particularly on paging URLs - have you noticed any decrease in spidering in the last couple of weeks?