Egy beteg srác naplója

urlencode

Re: Google bites IDNs

Thank to the loads of feedbacks it turned out that the Google bug mentioned earlier only comes up when using Firefox and you’re logged into your Google Account. When you log out or just simply copy the link from the results, it works fine, ‘cos only the JS based tracking (Personal Search) escapes it in a bad way.

More on, Matt Cutts already faced another IDN issue as he wrote it down in his blog:

Q: “Any results on why IDN Domains don’t show pagerank?”
A: I’ve seen a couple that do, but I’ll check into why most don’t. My guess is that there’s a normalization issue somewhere in the toolbar PageRank pathway.

Google bites IDNs

Poor Google is a bit buggy. Sooner coders there already faced some character encoding issues and now, have problems with domains containing international (non-ASCII) characters. Source of the bug I found nowadays is that using some JavaScript magic Google doesn’t really forward you direct to the given search result. It handles the hit itself for search analysis and user tracking, and then redirects you to the real target. Let’s look up for gábor.20y.hu. The corresponding link result Google will return with is something similar to this:

http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3A//g%E1bor.20y.hu/&ei=TXA1RMPxB7viwQHc-6WuAw&sig2=5dhrtGyojR_GPShMOKCdjg

Of course the guys at Google are smart, so they encoded the url parameter. Did they right? Not exactly. In internationalized domain names the special chars are not resolved like URL params. They have their own logic system, e.g. gábor means xn--gbor-5na for nameservers. So when you try to reach an URL like above, Firefox will notice you kindly that „Firefox can’t find the server at g%c3%a1bor.20y.hu.” And it’s got the point. It really doesn’t exist. So folks remember to use URL encoding carefully, do not encode domain names, only their GET parameters.