Crawler side effects of using XHTML entity references

We’re slowly moving towards making GreatSchools XHTML compliant (we have a long way to go though)! To start we’ve begun using proper XHTML entity references for URL’s with & as a separator instead of plain old & in a few places.

What’s interesting about this is that we’re seeing some errors in our weblogs of IP’s with modern user agents trying to access pages via: GET /foo.page?param1=value1&param2=value2 instead of GET /foo.page?param1=value1&param2=value2.

I’ve tested extensively in modern browsers across platforms and haven’t found a single browser that doesn’t translate the entity references properly. I suspect most of these are actually email harvesting crawlers just taking the value of the HREF and throwing it into a GET request without properly translating entity references.

I think we may have found one more way to identify crawlers masquerading behind real looking user agents!

This entry was posted in Design, Search Engine Optimization. Bookmark the permalink.