Cooking DevOps: Logs and troubleshooting in mesos

One of the most important, and hated, tasks in systems is troubleshooting. Despite all we know, our experience, our knowledge... systems fails. That's why is so important we know were to look to know what's failing, because without this information it's very difficult to fix the problem.

This tutorial is focused in redhat/CentOS 7, so if you are using a different distro take care because some files can be placed in different paths and/or the naming may be different.

Log Files

By default these are the most important log files in mesos:

Mesos master:

/var/log/mesos/mesos-master.ERROR
/var/log/mesos/mesos-master.INFO
/var/log/mesos/mesos-master.WARNING

Mesos Slave

/var/log/mesos/mesos-slave.ERROR
/var/log/mesos/mesos-slave.INFO
/var/log/mesos/mesos-slave.WARNING

Marathon, chronos and more logs from mesos master and slave

/var/log/messages

Yes, in /var/log/messages there are logs from mesos master, mesos slave, marathon and chronos ... This is unmanageable, so the first thing we must do it's define in syslog where to place every log in a different file. For this we must create a new file named /etc/rsyslog.d/mesos.conf whit the following content:

if $programname == 'marathon' then {
action(type="omfile" file="/var/log/mesos/marathon.log")
}
if $programname == 'chronos' then {
action(type="omfile" file="/var/log/mesos/chronos.log")
}
if $programname == 'mesos-master' then {
action(type="omfile" file="/var/log/mesos/mesos-master.log")
}
if $programname == 'mesos-slave' then {
action(type="omfile" file="/var/log/mesos/mesos-slave.log")
}

And restart rsyslogd service

# systemctl restart rsyslog

After that we will see the new log files in the /var/log/mesos directory with messages from their respective services.

# ls -l /var/log/mesos/*.log

-rw-------. 1 root root 108306 Nov 27 07:16 chronos.log
-rw-------. 1 root root 358013 Nov 27 07:16 marathon.log
-rw-------. 1 root root 518488 Nov 27 07:16 mesos-master.log
-rw-------. 1 root root 575739 Nov 27 07:16 mesos-slave.log

Finding Errors

Once we have or log files ordered and in place, we can look there for errors where we have some problem during mesos runtime. A good starting point is the marathon.log file and mesos gui.

Ww will send a failing task to marathon and follow the logs messages to find whats happening. This is the task we will send:

{
"id": "example",
"cmd": "/usr/local/bin/sleep 10"
"instances": 1,
"cpus": 0.1,
"mem": 64
}

The error is pretty clear, but we will work as we didn't appreciate it

# curl -X POST http://mesosmaster:8080/v2/apps -d @"example.json" -H "Content-type: application/json"

Now, in marathon's gui, we can see the task is in Waiting status forever.

The debug information doesn't tell us too many information, except the task is failing.

Marathon.log show the attempts to scale the task from 0 to 1 instances time after time, but doesn't tell the reason for the failure. And mesos-slave.log says us the task in RUNNING and then FAILED

# tail -f /var/log/mesos/mesos-slave.log
[..]
Nov 27 12:08:53 ip-172-31-27-67 mesos-slave[754]: I1127 12:08:53.269606 892 slave.cpp:2946] Sending acknowledgement for status update TASK_RUNNING (UUID: f81b406e-bd06-4292-96f4-7551f170ee9a) for task example.8928fb21-9529-11e5-a294-024277bab1b7 of framework 56fe3cff-8792-4023-8f3f-3d2764da2818-0000 to executor(1)@172.31.27.67:40701
Nov 27 12:08:53 ip-172-31-27-67 mesos-slave[754]: I1127 12:08:53.277469 892 status_update_manager.cpp:394] Received status update acknowledgement (UUID: f81b406e-bd06-4292-96f4-7551f170ee9a) for task example.8928fb21-9529-11e5-a294-024277bab1b7 of framework 56fe3cff-8792-4023-8f3f-3d2764da2818-0000
Nov 27 12:08:53 ip-172-31-27-67 mesos-slave[754]: I1127 12:08:53.277513 892 status_update_manager.cpp:826] Checkpointing ACK for status update TASK_RUNNING (UUID: f81b406e-bd06-4292-96f4-7551f170ee9a) for task example.8928fb21-9529-11e5-a294-024277bab1b7 of framework 56fe3cff-8792-4023-8f3f-3d2764da2818-0000
Nov 27 12:08:53 ip-172-31-27-67 mesos-slave[754]: I1127 12:08:53.367154 892 slave.cpp:2717] Handling status update TASK_FAILED (UUID: 966655b9-1cb9-4449-bea0-a88f06b641ef) for task example.8928fb21-9529-11e5-a294-024277bab1b7 of framework 56fe3cff-8792-4023-8f3f-3d2764da2818-0000 from executor(1)@172.31.27.67:40701
[..]

Then it's time to go to mesos gui to see what tasks there are and their status

These are the tasks marathon is trying to deploy, and all have failures. By clicking in "sandbox" option we can access to the task's stderr and stdout

Now we can get what's going with the task for not starting. But let me do a point, this method is difficult if you are not in the mesos node's net, because needs direct access from your browser to your node's private IP. So we will do the same using text console instead of gui, to watch what stderr and stdout files contains.

From the node where mesos is trying to start the task (in this case is "mesosmaster" working also as slave), we can access all this file directly from console. To find them we must use the following command:

# find /tmp -name "*stderr*" | grep example
[..]
/tmp/mesos/slaves/7ab5d196-6ecb-4424-868e-80d309b11e1f-S0/frameworks/56fe3cff-8792-4023-8f3f-3d2764da2818-0000/executors/example.ddc7b076-9528-11e5-a294-024277bab1b7/runs/e344824c-4db6-4451-a172-71a94796bc28/stderr
/tmp/mesos/slaves/7ab5d196-6ecb-4424-868e-80d309b11e1f-S0/frameworks/56fe3cff-8792-4023-8f3f-3d2764da2818-0000/executors/example.e0472f67-9528-11e5-a294-024277bab1b7/runs/0504c1ef-3998-4018-9d71-96acff3e607b/stderr
/tmp/mesos/slaves/7ab5d196-6ecb-4424-868e-80d309b11e1f-S0/frameworks/56fe3cff-8792-4023-8f3f-3d2764da2818-0000/executors/example.e34490e8-9528-11e5-a294-024277bab1b7/runs/77b3546d-7f04-46e4-80c3-2641ab1b2191/stderr
/tmp/mesos/slaves/7ab5d196-6ecb-4424-868e-80d309b11e1f-S0/frameworks/56fe3cff-8792-4023-8f3f-3d2764da2818-0000/executors/example.e6428ea9-9528-11e5-a294-024277bab1b7/runs/b2590417-c195-4d0a-802d-60a84f3f297c/stderr
/tmp/mesos/slaves/7ab5d196-6ecb-4424-868e-80d309b11e1f-S0/frameworks/56fe3cff-8792-4023-8f3f-3d2764da2818-0000/executors/example.e9408c6a-9528-11e5-a294-024277bab1b7/runs/78d4b89d-0861-435c-8398-627d67043950/stderr
[..]

In the grep you must replace "example" for your task's name.

These are the stderr from all the tasks mesos have been launching with errors. Now we can inspect one of them to look what's happening.

# cat /tmp/mesos/slaves/7ab5d196-6ecb-4424-868e-80d309b11e1f-S0/frameworks/56fe3cff-8792-4023-8f3f-3d2764da2818-0000/executors/example.e9408c6a-9528-11e5-a294-024277bab1b7/runs/78d4b89d-0861-435c-8398-627d67043950/stderr
I1127 12:04:24.979506 7689 exec.cpp:134] Version: 0.25.0
I1127 12:04:24.983109 7690 exec.cpp:208] Executor registered on slave 7ab5d196-6ecb-4424-868e-80d309b11e1f-S0
sh: /usr/local/bin/sleep: No such file or directory

As we already know, /usr/local/bin/sleep doesn't exists. Now we can fix it and send the task with the correct path.

Now we are ready to face the most common problems working with mesos and its tasks.

Friday, November 27, 2015

Logs and troubleshooting in mesos

Log Files

Finding Errors

6 comments: