Welcome! If you got here, you probably searched for "Could connect to localhost, but not to its IP address" or some variation of that phrase.
That's how I got here. To writing this post, that is. After unsuccessfully trying many clever variations of my query theme in finding an answer to this problem, we decided to troubleshoot it ourselves, ended up fixing it, and decided to blog for the benefit of the wider community.
If you're lucky, (a) our solution works for you and (b) this post shows up in the first few pages of your Google results saving you much time you might have wasted in following dead-ends.
Here is the problem - We discovered it first in our hadoop cluster, which had been set up by a contracted sysadmin no longer with us. The namenocde could not talk to any of the datanodes. The logs would show "Connection refused". Debugging with "telnet -d datanode 54311" would surface a misleading error message with setsockopt complaining about lack of permission. Funnily enough, when logged on to the datanode directly, we could issue a connection request to localhost, or 127.0.0.1 and it would work perfectly, but all connection requests to datanode, its fully qualified domain name, or even its IP address would fail.
Since we weren't running many other services on the datanodes, this problem didn't manifest in other applications. However, a quick check revealed that the problem was general. We could start up sendmail or an echo server and find the same discrepancy between connecting to localhost and its IP address.
Needless to say, as most people on various forums have tried to do, we too tried the obvious thing first - Perhaps an issue with SELinux or firewalls? We turned both off. But to no avail. That's when we decided to look a bit deeper.
showed that a listener was attached to 127.0.0.1:54311, which is to the loopback interface, but not to datanode.xyz.com:54311 or its IP-address:54311. Hmmm... how could that be? We checked the hadoop configuration and it was clear that it bound to the domain name of the host, not localhost. So what gives? That's when we discovered that /etc/hosts was somehow misconfigured on our machines. The culprit, specifically, was this line:
If you don't see the problem, it's the fact that datanode.xyz.com was being locally associated to the loopback IP address in this file. So of course, when a program issues a library call to resolve datanode.xyz.com, it's going to resolve to localhost, rather than its IP address, since the /etc/hosts file takes precedence over DNS queries.
Now the fix was clear. All we had to do was to restore the /etc/hosts file on all our datanodes. Simply replacing
with
did the trick. If it doesn't for you, then let me know and we'll have another think about this. I hope this helps.
&