Crawler side effects of using XHTML entity references

We’re slowly moving towards making GreatSchools XHTML compliant (we have a long way to go though)! To start we’ve begun using proper XHTML entity references for URL’s with & as a separator instead of plain old & in a few places.

What’s interesting about this is that we’re seeing some errors in our weblogs of IP’s with modern user agents trying to access pages via: GET /foo.page?param1=value1&param2=value2 instead of GET /foo.page?param1=value1&param2=value2.

I’ve tested extensively in modern browsers across platforms and haven’t found a single browser that doesn’t translate the entity references properly. I suspect most of these are actually email harvesting crawlers just taking the value of the HREF and throwing it into a GET request without properly translating entity references.

I think we may have found one more way to identify crawlers masquerading behind real looking user agents!

This entry was posted in Design, Search Engine Optimization. Bookmark the permalink.

Comments are closed.