Can We Predict Where Housing-related Public Health Problems Exist Using City Data?

By Katharine Robb • MARCH 22, 2021


Housing is a powerful social determinant of health, impacting social relationships, economic opportunity, environmental exposures, and security. The conditions of homes and neighborhoods influence everything from asthma to mental health. During the COVID-19 pandemic, safe housing is more important than ever. Not only are people spending more time at home where they may be exposed to dangerous conditions, but government budgets to prevent and respond to public health threats are reduced.

One of the primary strategies for breaking the link between poor housing and poor health is the enforcement of housing codes. Modern housing codes set minimum health and safety standards for rental housing. Certain code violations are 1) indicative of health risks (e.g., the association between mold or insect infestations and asthma) and/or 2) of high priority to cities (e.g., to identify overcrowded properties and plan for future housing needs). Housing code violations are an untapped wellspring of data on otherwise hidden public health risks.

The integration of housing code violation data and other administrative city data –such as the age, value, and size of homes, police and fire calls, etc. – combined with machine learning, can reveal how widespread and where housing-related health risks are. These capabilities can help cities, healthcare systems, and community organizations target limited resources, develop interventions, and track improvements to tackle public health risks effectively, efficiently, and equitably.


I began working in the City of Chelsea, MA as an Innovation Fellow with the Harvard Kennedy School Innovation Field Lab led by Jorrit de Jong. I spent a summer working closely with housing inspectors, accompanying them on inspections and observing the data they collected on housing code violations that threatened health. I wondered if we could use data the city was already collecting – not only as part of inspections, but across other departments – to predict where and how widespread Chelsea’s housing-related public health problems were. Research assistant Ashley Marcoux and I worked with staff across city departments to identify and digitalize administrative city datasets – linking the data by property ID through a collaboration with Tolemi, a city data analytics firm. With research assistant Nicolas Diaz Amigo, we built machine learning models using these data to predict the probability a given property in Chelsea, MA would have: 1) any housing code violation, 2) a set of high-risk public health violations, and 3) a specific high-risk public health violation (overcrowding).

Setting and Methods:

Chelsea is a small, densely populated, demographically diverse city located just outside Boston. Almost half of residents are foreign-born (46 percent). Per-capita income is $23,240/year, making Chelsea one of the poorest cities in Massachusetts. Chelsea has been especially hard-hit by the COVID-19 pandemic, with infection rates six times the state average in the spring of 2020. Most homes are wooden two-to-four-family units, built over a century ago. Almost 70 percent of residents are renters. Motived to improve housing conditions, Chelsea has run a proactive housing inspection program since 2015.

We selected the outcome of “any housing code violation” to estimate how widespread and where properties are with one or more of 38 possible housing code violations. The “high-risk violation” outcome is a composite of any of four violations: 1) lack of or non-functioning smoke detectors or carbon monoxide meters, 2) keyed locks on internal room doors, an indicator of overcrowded conditions in Chelsea, 3) infestations of insects, rodents, or skunks, and 4) accumulation of garbage in living areas. We selected “high-risk violations” because not all violations represent significant threats to public health and identified these four violations in partnership with Chelsea’s housing inspectors based on their local priorities. Finally, we selected a single violation type (overcrowding) as an outcome to demonstrate how integrated city data and machine learning can identify specific public health risks of concern in Chelsea. The full manuscript describing our model building, optimization, test characteristics, and results can be found here.


We first examined the prevalence and location of housing code violations in properties that had been proactively inspected between 2015 and 2019. Half (54 percent) of properties had a housing code violation, the majority of which were classified as high-risk (85 percent). Nearly 30 percent of properties contained a violation related to overcrowding. Housing codes stipulate the bare minimum standard for habitable housing; the fact that more than half of Chelsea’s renters live in homes with at least one code violation is significant.


Proportion of Properties with Code Violations (Observed)


Observed Proportion of Inspected Properties


Any Violation


High-Risk Violation


No Smoke Detectors




Garbage in Living Areas




*The High-Risk Violation is also comprised of the overcrowding violation (represented by locks on internal room doors)

We then estimated the probability of each outcome for the entire city (both inspected and uninspected properties). Maps of the estimated probabilities of each outcome reveal their spatial distribution and prevalence. Properties with estimated probabilities above 0.5 are predicted to be positive for that outcome and appear as orange/red on the maps. A large portion of the city is predicted to have any code violation and high-risk violations. A high probability of overcrowding is predicted in only a small section of the city.

Spatial Distribution and Prevalence of Predicted Housing Code Violations in Chelsea, MA

Heat maps side-by-side showing violations and overcrowding.

*Each circle represents a property and its color represents the predicted probability for each outcome. Circles are enlarged to protect privacy. Areas without color contain no rental properties.


Integrated city data and machine learning can pinpoint areas at elevated risk for housing-related health problems which can inform and enhance existing programs. For example, asthma home visiting programs can use data on where asthma triggers are more likely to ensure services reach all areas in need; fire safety education programs can use data on the prevalence of homes lacking smoke detectors. The data can also be used to prioritize inspection of properties based on their predicted risk level. Cities may choose to prioritize identification of higher-risk violations or those that are especially problematic for the health of their residents.

As cities respond to the COVID-19 pandemic, the use of housing data as an instrument to improve public health is urgently needed. Integrated city data and machine learning are tools cities can use to reach more people with essential services, even on tighter budgets. Because housing is a leading social determinant in infection risk and transmission of COVID-19, data on where overcrowding or other high-risk housing conditions are concentrated can help cities and community partners better support residents and plan for resource allocation.

Housing conditions have far-reaching impact on health and well-being. Leveraging existing city data to identify and intervene on housing-related public health threats is investing in the health of individuals, families, and communities.

About the Author

Katharine Robb

Katharine Robb studies urban environmental health in the US and low-resource settings abroad. She holds a Doctor of Public Health (DrPH) degree from the Harvard T.H. Chan School of Public Health and a Masters in Global Environmental Health from Emory University. She is currently a Postdoctoral Research Fellow at the Harvard Kennedy School Ash Center for Democratic Governance and Innovation.

Email the Author