Big Data Isn’t Sexy (And That’s OK)

By Brett Goldstein • February 22, 2017

Big data isn’t sexy anymore. Though it might be hard to believe, long before anyone had heard of deep learning, “big data” was the belle of the ball for government (and private sector) innovators. All they had to do was say big data and people would be impressed. Unfortunately, that often meant just holding a hackathon and releasing some datasets. On the other side, some big, complex problems were solved with data and celebrated as breakthroughs. The problem is, the attention on both ends of the spectrum ignored the real impact data could have on government’s daily operations, and it fell by the wayside. Spoiler alert: Big data was never sexy, but that doesn’t mean we should ignore it.

When we wrote Beyond Transparency: Open Data and the Future of Civic Innovation a number of years ago to survey the open data landscape, I felt it was critical that we think far beyond the idea of “open data” to ensure that we realized true value from the nascent efforts we were seeing in both open data but also data analytics.

Now that civic data has matured, it is the time for us to reinforce this concept of getting value from all of the data we have and leveraging it to do business better. Sometimes we get caught up in ensuring that a data product is shiny and sexy while instead we should focus on the core principles of how we make internal operations, from big business to government, run better and more efficiently. Major impact can start with small change, and not everything needs to be complicated. In fact, simple analysis can lead to large outcomes.

Let me offer up an example. Every jurisdiction (along with almost every company) maintains a financial ledger. These tend to be somewhat uniform in their presentation — dates, amounts, categories, recipient, etc. It is widely acknowledged that most systems have some sort of error term within their payments — there are duplicates, typos, frauds, etc. The percent of error is subject to much discussion and is not necessarily material to my point.

Seeing that this data is available and not frequently scrutinized, most people think it would be too complex to do some basic analysis on it. I would suggest that we are able to do quite a bit on our own. A fun and easy example is to leverage something like Benford’s law for the examination of the financial data. This law finds its roots with Simon Newcomb and Frank Benford and identifies the patterns of 1st digit distribution for certain types of data (including financial data). (Feeling bored? Read more here.)

Applying the law is actually quite easy. A quick search returns an R programming language package and if you look at GitHub there are multiple Python implementations that are ready to be used. From here, it is remarkably simple. You take the spreadsheet of financial data, run the code against it, and you are flagging anomalies.

To demonstrate the ease of doing this, I decided to see what kind of quick financial analysis I could do while on a recent plane flight from Chicago to Philadelphia. In just a few minutes, here were my steps:

  1. I leveraged a script posted on Activestate a few years ago — note it is only 52 lines of code.

  2. I downloaded a city's payments data from their open data portal.

  3. I ran the following command: python benford.py Payments.csv.

  4. Two seconds went by.

 

Bar graph with curve going down from left to right

City 1

With five minutes of effort, I combed through years of data, found an atypical pattern, and identified an interesting area of focus. The barrier to entry was minimal and the output was fast. It also gave me quick indicators of a set of transactions that would require further scrutiny: a simple way of potentially finding errors, duplications, or other types of financial inefficiencies.

As the flight was going long, I downloaded a second city’s data and ran the same analysis. Interestingly, their data seems to follow a much more predictable pattern. The quick algorithmic check seems to work — and that city can turn to other problems.

Second bar graph with curve going down but aligned with predicted dots

City 2

This approach obviously will not root out every potential problem in the data, but it is a start. Not only that, implementing analysis like this will get more government employees comfortable with data analysis, leading to more complex implementations down the road. I should also note that while I used a government data example in the above, math is universal and this is a great example of a technique that works just as well in the private sector as it does in the public sector. This point is one we should remember no matter what sector we operate in.

Big data isn’t sexy. But it doesn’t have to be. Data analytics should not be about getting a write-up in the press, it should be about making the hard, nitty-gritty work of government more efficient. Simple analysis can drive tremendous impact in government. People want to see their tax dollars spent more wisely, and that starts with analyzing the data we have now with techniques that are proven and well-established.

About the Author

Brett Goldstein

Brett Goldstein is an Innovations in American Government Fellow at the Ash Center at Harvard Kennedy School. He is also the Senior Fellow in Urban Science at Chicago Harris. In this role, he advises governments and major universities around the world on how to use data to inform smarter government decision-making and leads research projects using big data and analytics to better understand urban ecosystems. He also advises Harris on the master of science program in Computational Analysis and Public Policy, offered jointly with the Department for Computer Science. Goldstein works with the Computation Institute's Urban Center for Computation and Data (UrbanCCD) and serves as a liaison to other major universities that are beginning to do research and teaching in urban science, greatly broadening the reach and impact of the activities at Chicago Harris.