Honing in on the Homeless – the Splunkish way

e9038252f910c840e582818a63dd9908_400x400Have you noticed Splunk just released a new version, including new data visualizations? I had been eager to start playing with one of the new charts when yesterday I came across a blog post by Bob Rudis, who is co-author of the Data-Driven Security Book and former member of the Verizon’s DBIR team.

In that post, @hrbrmstr is presenting readers with a dataviz challenge based on data from U.S. Department of Housing and Urban Development (HUD) related to homeless population estimates. So I’ve decided to give it a go with Splunk.

Even though -we can’t compare- the power of R and other Stats/Dataviz focused programming languages with current Splunk programming language (SPL), this exercise may serve to demonstrate some of the capabilities of Splunk Enterprise.

Sidenote: In case you are into Machine Learning (ML) and Splunk, it’s also worth checking the new ML stuff just released along with Splunk 6.4, including the awesome ML Toolkit showcase app.

The challenge is basically about asking insightful, relevant questions to the HUD data sets and generating visualizations that would help answering those questions.

What the data sets can tell about the homeless population issue?

The following are the questions I try to answer,  considering the one proposed in the challenge post: Which “states” have the worst problem in terms of homeless people?

  1. Which states currently have the largest homeless population per capita?
  2. Which states currently have the largest absolute homeless population?
  3. Which states are being successful or failing on lowering the figures compared to previous years?

I am far from considering myself a data scientist (was looking up standard deviation formula the other day), but love playing with data like many other Infosec folks in our community. So please take it easy with newbies!

Since we are dealing with data points representing estimates and this is a sort of experiment/lab, take them with a grain of salt and consider adding “according to the data sets…here’s what that Splunk guy verified” to the statements found here.

Which states currently have the largest homeless population per capita?

For this one, it’s pretty straightforward to go with a Column chart for quick results. Another approach would be to gather map data and work on a Choropleth chart.

Basically, after calculating the normalized values (homeless/100k population), I filter in only the US states making the top of the list, limiting to 10 values . They are then sorted by values from year 2015 and displayed on the chart below:

homeless-ratio

Homeless per 100k of population – Top 10 US states

The District of Columbia clearly stands out, followed by Hawaii and New York. That’s one  I would never guess. But there seems to be some explanation for it.

Which states currently have the largest absolute homeless population?

In this case, only the homeless figures are considered for extracting the top 10 states. Below are the US states where most homeless population lives based on latest numbers (2015), click to enlarge.

homeless-abs

Homeless by absolute values – Top 10 US states

As many would guess, New York and California are leading here. Those two states along with Florida and Texas are clearly making the top of the list since 2007.

Which states are being successful or failing on lowering the figures compared to previous years?

Here we make use of a new visualization called Horizon chart. In case you are not familiar with this one, I encourage you to check this link where everything you need to know about it is carefully explained.

Basically, it eases the challenge of visualizing multiple (time) series with less space (height) by using layered bands with different color codes to represent relative positive/negative values, and different color shades (intensity) to represent the actual measured values (data points).

After crafting the SPL query, here’s the result (3 bands, smoothed edges) for all 50 states plus DC, present in the data sets:

horizon-chart

So how to read this visualization? Keep in mind the chart is based on the same prepared data used in the first chart (homeless/100k population).

The red color means the data point is higher when compared to the previous measurement (more homeless/capita), whereas the blue represents a negative difference when comparing current and last measurements (less homeless/capita). This way, the chart also conveys trending, possibly uncovering the change in direction over time.

The more intense the color is, the higher the (absolute) value. You can also picture it as a stacked area chart without needing extra height for rendering.

The numbers listed at the right hand side represent the difference between immediate data points point in the timeline (current/previous). For instance, last year’s ratio (2015) for Washington decreased by ~96 as compared to the previous year (2014).

On a Splunk dashboard or from the search query interface (Web GUI), there’s also an interactive line that displays the relative values as the user hovers over a point in the timeline, which is really handy (seen below).

horizon_crop

The original data files are provided below and also referenced from the challenge’s blog and GitHub pages. I used a xlsx2csv one-liner before handling the data at Splunk (many other ways to do it though).

HUD’s homeless population figures (per State)
US Population (per State)

The Splunk query used to generate the data used as input for the Horizon chart is listed below. It seems a bit hacky, but does the job well without too much effort.

| inputlookup 2007-2015-PIT-Counts-by-State.csv
| streamstats last(eval(case(match(Total_Homeless, "Total"), Total_Homeless))) as _time_Homeless
| where NOT State_Homeless="State"
| rex mode=sed field=_time_Homeless "s|(^[^\d]+)(\d+)|\2-01-01|"
| rename *_Homeless AS *
| join max=0 type=inner _time State [
  | inputlookup uspop.csv
  | table iso_3166_2 name
  | map maxsearches=51 search="
    | inputlookup uspop.csv WHERE iso_3166_2=\"$iso_3166_2$\"
    | table X*
    | transpose column_name=\"_time\"
    | rename \"row 1\" AS \"Population\"
    | eval State=\"$iso_3166_2$\"
    | eval Name=\"$name$\"
  "
  | rex mode=sed field=_time "s|(^[^\d]+)(\d+)|\2-01-01|"
]
| eval _time=strptime(_time, "%Y-%m-%d&amp")
| eval ratio=round((100000*Total)/Population)
| chart useother=f limit=51 values(ratio) AS ratio over _time by Name

Want to check out more of those write-ups? I did one in Portuguese related to Brazil’s Federal Budget application (also based on Splunk charts). Perhaps I will update this one soon with new charts and a short English version.

My 1st Splunk app: RAW Charts

d3rawAfter some days playing around with a few interesting apps, I’ve decided to give it a try, and learn how to integrate RAW data visualization project into Splunk.

It turns out, by reading the (latest) right App Development documentation (thanks IRC!) and checking good examples, it’s quite an easy job, especially if you are already familiar with web development technologies (HTML, JS/jQuery and the likes).

Here’s a bit of motivation to do it:

  • Connecting with the Splunk community;
  • Getting up to speed with the Splunk Web Framework for quickly developing custom content (views, dashboards, apps, etc);
  • Easily visualizing search results in different formats by leveraging the search bar functionality, rather than editing hard-coded dashboard searches;
  • Helping to spread the word about the power of data visualization by demonstrating the incredible D3 library and the RAW project;
  • Having fun! (a must for any learning experience nowadays, right?)

RAW project?

I will not dare describing it better than the creators of this great project:

“The missing link between spreadsheets and vector graphics.”

A more detailed description is also found from the project’s README file:

RAW is an open web tool developed at the DensityDesign Research Lab (Politecnico di Milano) to create custom vector-based visualizations on top of the amazing d3.js library by Mike Bostock. Primarily conceived as a tool for designers and vis geeks, RAW aims at providing a missing link between spreadsheet applications (e.g. Microsoft Excel, Apple Numbers, Google Docs, OpenRefine, …) and vector graphics editors (e.g. Adobe Illustrator, Inkscape, …).

What you can do instead is simply browsing the project interface here: app.raw.densitydesign.org. Paste your data or just pick one data sample to realize how easy it is to create a chart without a single line of code.

And since we are talking about one line of code, let’s get straight to the point. Here’s a dirty quick hack for automatically copying the search results into RAW’s worklfow:

$scope.text = localStorage.getItem('searchresults')

In fact, I’m not sure if that’s the optimal way to accomplish it, but that’s the only change needed within RAW’s code (controllers.js). The wonderful Italian mafia team at Density Design might be reading this now, so guys please advise! (I know you are very busy).

Nevertheless, after a quick read through AngularJS, that change looks like a quick win. What it does is tell the browser to load the data from a local storage into RAW’s textarea. Local storage? Remember Cookies and HotDog editor? That’s history! Actually, not.

The Splunk Code

By using the Web Framework Toolkit, creating an app is really easy. Just use the splunkdj createapp <app-name> command and start customizing the default view that is built in, home.html. Here’s the main code piece used for this app (JavaScript block):

{% block js %}
<script>

function createIframe(){
    // reset div contents
    document.getElementById("raw-charts").innerHTML = "";

    // create an iframe
    var rawframe = document.createElement("iframe");
    rawframe.id = "rawframe";
    rawframe.src = "{{STATIC_URL}}{{app_name}}/raw/index.html";
    rawframe.scrolling = "no";
    rawframe.style.border = "none";
    rawframe.width = "100%";
    rawframe.height = "3700px";

    // insert iframe
    document.getElementById("raw-charts").appendChild(rawframe);

};

var deps = [
	"splunkjs/ready!",
	"splunkjs/mvc/searchmanager"
];

require(deps, function(mvc) {

	// this guy handles the search/results
	var SearchManager = require("splunkjs/mvc/searchmanager");

	// initial search definition
	var mainSearch = new SearchManager({
		id: "search1",
		//search: "startminutesago=1 index=_internal | stats c by group | head 2",
		search: "",
		max_count: 999999,
		preview: false,
		cache: false
	});

	// count: 0 needed for avoiding the 100 limit (Thanks IRC #splunk!)
	var myResults = mainSearch.data("results", {count: 0});

	// tested with "on search:done" but unexpected results happened
	myResults.on("data", function() {  

		// field names separated by comma
		var searchresults = myResults.data().fields.join();

		// debug code
		//console.log(myResults.collection());

		// loop through the result set
		for (var i=0; i < myResults.data().rows.length; i++) {
			searchresults = searchresults + '\n' + myResults.data().rows[i];
		}

		// better than cookie!
		localStorage.setItem('searchresults',searchresults);

		// search loaded, triggering iframe creation
		createIframe();

	});

	// keep search bar and manager in sync
	var searchbar1 = mvc.Components.getInstance('searchbar1');
	var search1 = mvc.Components.getInstance('search1');

	searchbar1.on('change', function(){
		search1.settings.unset('search');
		search1.settings.set('search', searchbar1.val());
	});
});

</script>

{% endblock js %}

The initial page for the app loads an empty search bar with a table view component right below it. After running a search, the table displays the search results and also triggers the RAW workflow, by loading the textarea with the table’s content.

Meet the workflow

In a nutshell, the visualization workflow works like Splunk’s default. The user runs a search command, formats the results and finally clicks on “Visualization” tab. Likewise, using this app the user is also able to customize chart options and export the results in different formats.

First Example

Here’s the first example in action, reachable via Chart Examples menu. The data comes from Transport of London data portal, this specific data set (CSV) is a sample for the Rolling Origin & Destination Survey (RODS) available under “Network Statistics” section from the portal.

Before handling the CSV file, the following command is needed for cleaning up the file header, basically replacing slashes and spaces by a “_” char:

sed -i '1,1s/[[:blank:]]*\/[[:blank:]]*\|\([[:alnum:]]\)[[:blank:]]\+\([[:alnum:]]\)/\1_\2/g;' rods-access-mode-2010-sample.csv

After clicking at the link example, the search bar gets preloaded with a specific search command, which triggers the table reload:

Example 1 The results are synced to RAW’s input component, which is fully editable just in case:

The user is then able to choose one chart type (multiples available). Here, the Alluvial/Sankey diagram is chosen:

There’s also an option for adding your own chart in case you are willing to integrate your D3 code implementation with the project.

The next step is to select which fields (columns) will be part of the diagram/chart, and also how they will relate to the chart’s components (dimensions, steps, hierarchy, etc). For doing so, a nice drag and drop interface eases the job.

Just follow the instructions included within the example (step-by-step) . The final map setup should look like the following:

Finally, here’s the chart generated in the end:

As you can see from this simple example, the chart better conveys the idea of flow & proportionality among the dimensions as compared to other usual charting options out there.

Optionally, the user is able to customize colors, sorting and other stuff, which may differ depending on the chart chosen. Exporting options are also available (SVG/HTML, PNG, etc).

Second Example

The second example leverages data from the World Bank data portal related to Internet subscribers. For this case, I’ve decided to apply a few constraints so that it becomes a bit simpler to render the results:

  • Only a few countries are filtered in;
  • Time period considered is 2000-2009.

By following roughly the same steps described from example previously shown, the search gets preloaded with a search command and the user is instructed to follow a few steps to generate the graph. In this case, a Bump Chart, similarly to the one featured at NYT.

I hope the screenshots speak for themselves (click for full size). Detailed instructions are available from the app’s documentation and examples.

Here’s a list of currently supported charts/diagrams: Sankey / Alluvial, Bump Chart, Circle Packing, Circular / Cluster Dendogram, Clustered Force Layout, Convex Hull, Delaunay Triangulation, Hexagonal Binning, Parallel Coordinates, Reingold-Tilford Tree, Streamgraph, Treemap, Voronoi Tessellation.

Comments and suggestions are more than welcome! The app is available at Splunk’s app portal, and I will later upload the code to a common place (Github?) so it makes easier for everyone to have access and modify it.