6 Effective CLI Tools to Debug Distributed Systems

December 5, 2019

Debugging distributed systems can be a pain. There are so many things that could go wrong, from application failures to container permissions to firewalls and networking issues. In this blog post we will get to know 6 useful CLI tools to help identify the root cause of any unexpected behavior.

Is there a networking issue?

Distributed systems need to communicate and that communication is a common cause of trouble. The first step to debug networking issues is to determine which component needs to communicate to whom using which protocol. So to begin, you need to identify the following information:

What is the target hostname / IP address?
What is the target port to connect to?
What transport protocol is used? TCP, UDP?
Is the connection secured with certificates?

To identify issues on the different layers, let's collect information step by step.

Is DNS resolution failing? [dig]

Maybe the connection does not work as expected because DNS cannot resolve the hostname of the target system. This is usually easy to spot, as it is often stated explicitly in log messages of affected components. To check DNS resolution you can use dig:

$ dig mimacom.com

; <<>> DiG 9.10.6 <<>> mimacom.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55149
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 13, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;mimacom.com.			IN	A

;; ANSWER SECTION:
mimacom.com.		53	IN	A	13.32.166.218       (1)
mimacom.com.		53	IN	A	13.32.166.62
mimacom.com.		53	IN	A	13.32.166.81
mimacom.com.		53	IN	A	13.32.166.232

;; Query time: 27 msec
;; SERVER: 10.10.142.1#53(10.10.142.1)              (2)
;; WHEN: Mon Dec 02 10:15:40 CET 2019
;; MSG SIZE  rcvd: 315

Look at the dig output at (1) to see whether the correct records are being resolved. Sometimes DNS resolution issues are also caused by different responses of different name servers. The name server that is being used to resolve the hostname is shown at (2). To see the DNS response of a specific name server you can tell dig which name server to resolve with:

$ dig mimacom.com @1.1.1.1
...

Can a TCP/UDP connection be established? [nc]

Occasionally, there are firewalls in our data centers :-) Especially with TCP connections, depending on which packets are filtered by the firewall and depending on if any packet is returned by the firewall as a response to filtered requests, you may see different error messages in log files that some connection attempt did not work out.

It is very useful to simply verify that a certain connection from host A to host B with a specific protocol on a specific port actually works. To do so, you may use netcat, which is available in most Unix distributions:

$ nc -vz mimacom.com 443
Connection to mimacom.com port 443 [tcp/https] succeeded!

In case your firewall might filter on source ports (the port from which the request is being sent on the sending machine), you can even specify a local port to use with the -p flag. In case you want to test a UDP connection, simply add the -u flag to the netcat command.

Is the request working as expected? [curl]

Now that we know that we can reach the target destination, it is increasingly interesting to see whether the request we aim to make works as expected. Luckily, we can issue HTTP requests from the command line using curl. There are quite some interesting flags to curl that help in debugging requests:

-I or --head: Fetches only the HTTP headers (skips printing the content).
-k or --insecure: Skip SSL certificate validation.
-L or --location: Curl follows any 3xx redirects automatically.
-v or --verbose: Get that extra level of information in case you are really desperate.

So here is an example curl request to fetch the HTTP headers to see what's going on:

$ curl -I https://mimacom.com
HTTP/2 301
content-length: 0
location: http://www.mimacom.com/
date: Mon, 02 Dec 2019 03:50:49 GMT
server: AmazonS3
x-cache: Hit from cloudfront
via: 1.1 3df8c233328fbbb4fd91eb496d73f2d8.cloudfront.net (CloudFront)
x-amz-cf-pop: FRA54
x-amz-cf-id: OwjDI2mA9Gbf6i1G07vUnVxFfdZjWNcRdPl9YFFCPGDIXbQBp4y8zA==

Is There an Issue With my Application?

Given we have shell access to a machine or a container running our application, we can leverage several tools to extract useful information from a couple of locations. Generally, we're interested in two things:

do the log files contain any useful hints?
is the application process running and performing as expected?

Examining application logs [grep]

If we know where the application stores it's log files, we can examine them for useful insights on any misbehavior. To extract relevant parts of huge log files grep is our best friend. Given we have a directory that contains multiple log files for the application (e.g. app stderr and stdout, web server logs, process monitor logs, etc.), we can skim through all of them with grep:

# search for any errors
$ grep -iR error | less

# search for a regular expression
$ grep -ER "std(err|out)" ./* | less

If you know what to look for, grep can do a great job at pinpointing you to the right information. But what if you don't know what to look for?

Streaming application logs [tail]

If we don't know what exactly to look for in application logs, we can "stream" any changes to all logs files using tail while we stimulate the application misbehavior (e.g. by triggering a failing request with curl). To see additions to all log files we can attach to the log files with tail as follows:

$ tail -F ./*.log
==> ./app.stderr.log <==

==> ./app.stdout.log <==

==> ./nginx.stderr.log <==

==> ./nginx.stdout.log <==

==> ./app.stdout.log <==
Starting to process '5' blog posts...

New lines appended to any of the monitored log files will now be printed to the console, so you can easily follow along what happens during a specific time window. Of course that will only work well if there is little additional noise, so this won't work if the application is under heavy traffic producing lots of log files.

Checking which processes listen to which port [netstat]

If an application should listen to incoming requests, then we can see what ports are being listened on by which process and see if everything is setup correctly. Using netstat we can see all processes that are actively listening on ports:

$ netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:25250           0.0.0.0:*               LISTEN      5093/nginx.conf
tcp        0      0 127.0.0.1:25923         0.0.0.0:*               LISTEN      5374/ruby
tcp        0      0 127.0.0.1:2822          0.0.0.0:*               LISTEN      4962/monit
tcp        0      0 127.0.0.1:2825          0.0.0.0:*               LISTEN      623/bosh-agent
tcp        0      0 0.0.0.0:8844            0.0.0.0:*               LISTEN      6520/java               (1)
tcp        0      0 0.0.0.0:25555           0.0.0.0:*               LISTEN      5346/nginx.conf
tcp        0      0 127.0.0.1:25556         0.0.0.0:*               LISTEN      5140/127.0.0.1:2555
tcp        0      0 0.0.0.0:6868            0.0.0.0:*               LISTEN      623/bosh-agent
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      823/sshd
tcp        0      0 127.0.0.1:5432          0.0.0.0:*               LISTEN      5042/postgres
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      5397/java
tcp        0      0 0.0.0.0:4222            0.0.0.0:*               LISTEN      4990/gnatsd
udp     3840      0 127.0.0.1:323           0.0.0.0:*                           831/chronyd
udp        0      0 0.0.0.0:60302           0.0.0.0:*                           5397/java

If you don't see the PID and program name make sure you have root permissions. In the example above, we see at (1) that a Java process listens on port 8844 with PID 6520. Knowing the process ID we can inspect all info concerning the process itself, for example the full startup command:

# all info for a process is stored in /proc/<PID>/
# Note: the following output is formatted here for readability
$ cat /proc/6520/cmdline
java -Xmx1024m \
  -Dspring.profiles.active=prod \
  -Dspring.config.location=/var/vcap/jobs/credhub/config/application.yml \
  -Dlog4j.configurationFile=/var/vcap/jobs/credhub/config/log4j2.properties \
  -Djava.security.egd=file:/dev/urandom-Djna.boot.library.path=/var/vcap/packages/credhub/ \
  -Djava.io.tmpdir=/var/vcap/jobs/credhub/tmp-Djdk.tls.ephemeralDHKeySize=4096 \
  -Djdk.tls.namedGroups=secp384r1 \
  -Djavax.net.ssl.trustStore=/var/vcap/jobs/credhub/config/trust_store.jks \
  -Djavax.net.ssl.trustStorePassword=S3cretTru5tSt0re \
  -ea \
  -jar credhub.jar

In case your application cannot be reached from other machines, make sure that the foreign address is not restricted. The foreign address can be used to restrict incoming requests to originate from specific IP addresses and ports.

Debugging is Fun (Again)

When faced with application misbehavior it is at times difficult to determine the exact root cause. The commands shown in this article cover a broad range of topics that help confirm or reject your suspicion concerning the root cause. What commands do you use to debug your applications? Let us know in the comments! And stay tuned for some more debugging tips during our blogging advents calendar this month.