Unstructured Data: What Is It and Where Do Local Governments Produce It?

BY BETSY GARDNER • December 6, 2021

Those quick notes to yourself on Slack. The photo you snapped of the full whiteboard at the end of your last meeting. The Zoom call you recorded to reference later. All of this is data, information that is important and useful and necessary — and completely unstructured.

Unstructured data is “typically categorized as qualitative data, [and] cannot be processed and analyzed via conventional data tools and methods,” according to IBM. Data is growing at an exponential rate, yet the majority — most estimate more than 80 percent — is unstructured, meaning that it isn’t in a conventional data model or format. Which means that texts, social media posts, survey responses, videos, Power Points, web server logs, municipal chatbots, audio clips, photos, and voice memos are all unstructured data, and represent a massive untapped and disorganized cache of information. 

As in the examples above, plenty of unstructured data is generated over the usual course of business; in government, this includes the unstructured information produced by civil servants as well as publicly posted comments, complaint logs, 311 calls, and videos and photos sent from residents. This unstructured information is valuable for training, service improvements, and operational enhancements, but the unstructured form is challenging to process.

Graph showing the quick increase of data by years

For local governments, data-driven policies that only rely on easily analyzed, quantitative and numerical data won’t produce the most informed and inclusive decisions possible. Although it requires additional processing, unstructured data can be incredibly useful in the spheres of public health, regulation, public safety, and transportation.

How can Unstructured Data become Useable?

Most unstructured data can be processed by AI. Natural language processing, image analysis, and machine learning are all tools that can turn unstructured information into useful, readable data.

Natural language processing (NLP) is a well-established way to analyze written and spoken language. More advanced analysis can also identify sentiments behind language, as outlined in the vaccine sentiment analysis project above. Governments are using NLP in a variety of ways, from chatbots to document analysis. The New South Wales, Australia, Treasury utilized NLP to review troves of regulatory data to find outdated or burdensome language and legislation. Once these confusing or unnecessary sections have been identified, employees can pull the flagged areas and find ways to improve or update them.

Image analysis is another way that AI can pull information from static, unstructured data like photographs and videos. In Spain, scientists working on traffic accident prevention have identified “visual layouts,” or scenes that correlate with accidents, by training AI to scan scenes of streets before, during, and after accidents to find visual patterns. Once these locations  have been identified, urban planners and city officials can work to prevent the conditions that lead to accidents. In Dallas, Texas, city officials are also using image analysis to improve the  traffic safety  one of the most dangerous cities for driving  in the United States. The city gathers and gathering and translates visual data about traffic conditions, speeding, accidents, and road conditions is a key method for increasing public safety.

At a basic level, machine learning is a way for algorithms to find patterns in large amounts of data, structured and unstructured, using statistical analysis. Machine learning serves many functions across multiple departments, although one of the most important contributions is monitoring for things like fraud and misuse of funds. For example, federal agencies such as the Internal Revenue Service, Securities and Exchange Commission, and Department of the Treasury deploy machine learning algorithms to detect suspicious movement of funds or insider trading. Once machine learning identifies possible issues by finding discrepancies in data patterns, an employee can then evaluate the flagged issue, saving time and money.

Machine learning can also evaluate unstructured data like employee notes, oral transmission from police officers, and public meeting comments in order to identify common issues or complaints, Machine learning expedites the process, helping employees recognize patterns while directing their attention to the areas most in need of attention. Cities are also investigating machine learning that might  predict adverse police behavior based on a combination of structured and unstructured data.

What are Challenges for Cities?

Implementing machine learning, NLP, and image analysis does require an investment in both the technical tools to harness unstructured data and the employee knowledge and skillset to process it. Researchers in the United Kingdom expect that “realising the full value in local public data will require the emergence of a new type of role within local government,” which could be a roadblock for municipal governments. However, many cities have developed data training to upskill employees  and purposefully foster a data culture among public employees.

Since cities are already producing – and receiving – unstructured data, it’s worthwhile to invest in a system for processing this type of data, and the employees that can handle it. For example, the city-parish of Baton Rouge experienced extreme flooding in 2016. During the flood, the city’s GIS team was desperately trying to map where the water was worst, in order to find which residents were dealing with severe housing and flood damage and how disaster services should be distributed. While the city had its own data sources, the GIS team asked residents to send photos of their conditions and water damage to supplement the official data. Residents quickly responded and posted photos to the local government through social media. Thankfully, the GIS team was able to process these photos and better react to the flooding.

City leaders can determine the most common forms of unstructured data that they produce and choose to invest in a process analyze that type of data. In the above example, Baton Rouge was relying on image analysis, as are the city officials in Dallas who are analyzing traffic patterns. Yet investing in NLP might be more effective for cities that need to scan and review massive amounts of texts, such as necessary as part of a review of outdated building codes and zoning laws.

Another crucial challenge for cities is making sure that AI, NLP, and machine learning systems are as unbiased as possible,  That goal requires a process to reviewed and correct issues such as those that incorporate racial and gender biases that might carry negative downstream effects. There are several resources for preventing AI bias in the federal government down to creating fairer AI within local governments, as well as ways to engage communities in bias prevention work.

Unstructured data provides the means for exposing everything from bias to inefficiency. But it also carries with it risks in terms of privacy and bias.  Cities that can claim the progress of learning from unstructured data while protecting the privacy rights of its residents can provide a better  solution.


About the Author

Betsy Gardner

Betsy Gardner is the editor of Data-Smart City Solutions and the producer of the Data-Smart City Pod. Prior to joining the Ash Center, Betsy worked in a variety of roles in higher education, focusing on deconstructing racial and gender inequality through research, writing, and facilitation. She also researched government spending and transparency at the Lincoln Institute of Land Policy. Betsy holds a master’s degree in Urban and Regional Policy from Northeastern University, a bachelor’s degree in Art History from Boston University, and a graduate certificate in Digital Storytelling from the Harvard Extension School.