We’re slowly moving towards making GreatSchools XHTML compliant (we have a long way to go though)! To start we’ve begun using proper XHTML entity references for URL’s with & as a separator instead of plain old & in a few places.
What’s interesting about this is that we’re seeing some errors in our weblogs of IP’s with modern user agents trying to access pages via: GET /foo.page?param1=value1&param2=value2 instead of GET /foo.page?param1=value1¶m2=value2.
I’ve tested extensively in modern browsers across platforms and haven’t found a single browser that doesn’t translate the entity references properly. I suspect most of these are actually email harvesting crawlers just taking the value of the HREF and throwing it into a GET request without properly translating entity references.
I think we may have found one more way to identify crawlers masquerading behind real looking user agents!