In this demo we have a look into interactively analyzing application logs. As a source for the application logs we're using WordPress, a popular blogging engine.
The demo shows how to ingest the application logs into Minio, an object store akin to Amazon S3 and demonstrates how to query those logs with SQL, using Apache Drill, a distributed schema-free query engine.
- Estimated time for completion:
- Install: 20min
- Target audience: Anyone interested in interactive application log analysis.
Table of Contents:
- Architecture
- Prerequisites
- Install the demo
- Use the demo
Log data is generated in WordPress (WP) by an end-user interacting with it, this data gets loaded into Minio and Apache Drill is then used to interactively query it.
- A running DC/OS 1.9.0 or higher cluster with at least 3 private agents and 1 public agent each with 2 CPUs and 5 GB of RAM available as well as the DC/OS CLI installed in version 0.14 or higher.
- The dcos/demo Git repo must be available locally, use:
git clone https://github.com/dcos/demos.git
if you haven't done so, yet. - The JSON query util jq must be installed.
- SSH cluster access must be set up.
Going forward we'll call the directory you cloned the dcos/demo
Git repo into $DEMO_HOME
.
For Minio and Apache Drill we need to have Marathon-LB installed:
$ dcos package install marathon-lb
To serve the log data for analysis in Drill we use Minio in this demo, just as you would use, say, S3 in AWS.
To set up Minio find out the IP of the public agent
and store it in an environment variable called $PUBLIC_AGENT_IP
, for example:
$ export PUBLIC_AGENT_IP=52.24.255.200
Now you can install the Minio package like so:
$ cd $DEMO_HOME/applogs/1.9/
$ ./install-minio.sh
After this, Minio is available on port 9000 of the public agent, so open $PUBLIC_AGENT_IP
on port 9000 in your browser and you should see the UI.
The default login is "minio/minio123".
Note that you can learn more about Minio and the credentials in the respective example.
Apache Drill is a distributed SQL query engine, allowing you to interactively explore heterogenous datasets across data sources (CSV, JSON, HDFS, HBase, MongoDB, S3).
A prerequisite for the Drill install that the $PUBLIC_AGENT_IP
environment variable is set, which should have done in the previous step.
Now do the following to install Drill:
$ cd $DEMO_HOME/applogs/1.9/
$ ./install-drill.sh
Go to http://$PUBLIC_AGENT_IP:8047/
to access the Drill Web UI:
Next we need to configure the S3 storage plugin in order to access data on Minio.
For this, go to the Storage
tab in Drill, enable the s3
plugin, click on the Update
button and paste the content of your (local) drill-s3-plugin-config.json into the field, overwriting everything which was there in the first place:
After another click on the Update
button the data is stored in ZooKeeper and persists even if you restart Drill.
To check if everything is working fine, go to Minio and create a test
bucket and upload drill/apache.log
into it.
Now, go to the Drill UI, change to the Query
tab and execute the following query to verify your setup:
select * from s3.`apache.log`
You should see something like the following:
Next we install WordPress, acting as the data source for the logs.
Note that the environment variable called $PUBLIC_AGENT_IP
must be exported.
$ cd $DEMO_HOME/applogs/1.9/
$ ./install-wp.sh
Discover where WP is available via HAProxy http://$PUBLIC_AGENT_IP:9090/haproxy?stats
(look for the wordpress_XXXXX
frontend):
In my case, WP is available via port 10102
on the public agent, that is via http://$PUBLIC_AGENT_IP:10102/
:
Finally, complete the WP install so that it can be used.
The following sections describe how to use the demo after having installed it.
First interact with WP, that is, create some posts and surf around. Then, to capture the logs, execute the following locally (on your machine):
$ cd $DEMO_HOME/applogs/1.9/
$ echo remote ignore0 ignore1 timestamp request status size origin agent > session.log && dcos task log --lines 1000 wordpress | tail -n +30 | sed 's, \[\(.*\)\] , \"\1\" ,' >> session.log
Next upload session.log
into the test
bucket in Minio.
Now you can use Drill to understand the usage patterns, for example:
List HTTP requests with more than 1000 bytes payload:
select remote, request from s3.`session.log` where size > 1000
Above query results in something like:
List HTTP requests that succeeded (HTTP status code 200):
select remote, request, status from s3.`session.log` where status = 200
Above query results in something like:
In this demo we ingested application log data from WordPress into Minio and queried it using Apache Drill.
- An area of improvement is the ingestion process which is currently implemented locally, that is, via manually using the DC/OS CLI on your machine. A more advanced scenario would, for example, use a DC/OS Job to periodically ingest the logs in a timestamped manner into Minio.
- While Drill is set up in distributed mode currently only a single drillbit is used; by scaling the Drill service, one can query more data, faster.
- The current result is of tabular form (as a result of the SQL queries issued). A more insightful way to render the query results would be to use BI tools such as Tableau or Datameer, connecting them via the JDBC interface.
Should you have any questions or suggestions concerning the demo, please raise an issue in Jira or let us know via the [email protected] mailing list.