Crawler side effects of using XHTML entity references

We’re slowly moving towards making GreatSchools XHTML compliant (we have a long way to go though)! To start we’ve begun using proper XHTML entity references for URL’s with & as a separator instead of plain old & in a few places.

What’s interesting about this is that we’re seeing some errors in our weblogs of IP’s with modern user agents trying to access pages via: GET /foo.page?param1=value1&param2=value2 instead of GET /foo.page?param1=value1&param2=value2.

I’ve tested extensively in modern browsers across platforms and haven’t found a single browser that doesn’t translate the entity references properly. I suspect most of these are actually email harvesting crawlers just taking the value of the HREF and throwing it into a GET request without properly translating entity references.

I think we may have found one more way to identify crawlers masquerading behind real looking user agents!

Crawler side effects of using XHTML entity references

About Me

Links

Recent Posts

Categories