This is the first post in the Data Science Deep Dive series, where we discuss our data science practices at Delfina. In this post, we’ll talk about what we call “second order missingness” in EHR data.
Missing data is a fact of life for most data scientists. Data is considered to be missing if there's a piece of information we get to see for some subjects in the population, but not all subjects. For example, if a survey is administered to patients, and some patients decline to answer a question about health history, we'd consider information about health history to be missing for those patients.
However, patients who did answer the question may have given us inaccurate information, maybe because they forgot part of their medical history or did not want to share the information. In situations like this, we say the data is subject to measurement error (or misclassification). Something about our data collection process—in this case, just asking the patient—fails to capture the full truth about the information.
In both cases, there's a disconnect between what we see, and what we want to see. In the case of missing data, the disconnect is total. All we see is a question mark. In the case of measurement error, the nature of the disconnect varies by situation, but we can typically make some reasonable assumptions to tie what we see back to something close to the truth.
There are some surprisingly simple situations in which neither paradigm quite applies. Suppose we have the modest aim of just trying to figure out whether something has or has not happened. In our work, for example, we often would like to know whether a patient has ever been diagnosed with hypertension. We typically have access to electronic health records (EHRs) from a single clinic or health system, which includes information from patient intake forms, diagnostic codes, and provider notes. If we don't see evidence of a hypertension diagnosis in the EHRs, can we conclude that the patient has never been diagnosed with hypertension?
“Absence of evidence is not evidence of absence” - Ancient proverb
It would seem that we can’t reasonably make that conclusion. The quote above has been invoked in every field from anthropology to cosmology to medicine. Proving a negative is difficult. How do we know whether the hypertension diagnosis is not in the record because it never happened, or because we're missing a critical piece of EHR history?
Methods for handling missing data often assume we have access to something called a "missingness indicator," which tells us whether data is missing. When we know for a fact that data is missing—as in the case of unanswered survey questions, or unadministered medical labs—this assumption makes sense. In these situations, we do have “evidence of absence”. We know exactly what we don't know. Missing data is always difficult to handle correctly, but there is at least an extensive literature on dealing with the case of known unknowns.
Unfortunately, in EHR data, we often have only an “absence of evidence”. But it’s even worse than that - in the case of an absent diagnosis code for hypertension (see Figure), evidence of absence and absence of evidence present in exactly the same way. We don't know which one we have. Do we have absence of evidence, in which case the data is missing, or evidence of absence, in which case it’s not? Because we don’t know the answer to this question, the missingness indicator itself is missing. For that reason, we can think of absence of evidence as being subject to what we will call “second order missingness”.
When this kind of problem comes up in other fields, such as in single cell gene expression, analysts often reach for imputation-based methods, where potentially missing information is “filled in” based on other observed variables. In our analyses of pregnant patients, we can’t use these solutions. If the missingness is tied in any way to underlying characteristics of the patient that we don't see—which it often is—then imputation-based methods will lead to biased results (Sterne et al., 2009). Pregnancy care is also more sensitive to mistakes in important ways. If gene expression data were incorrectly imputed for a single cell, it basically wouldn’t matter. If we get something wrong for a single patient, it really matters.
While there is much we can learn from other situations in which second order missingness comes up, unfortunately we haven't found much published on principled methods for dealing with this problem. It's tempting to think we can get ourselves out of this situation with ad hoc rules. For example, we might think that if we have X years of “complete-looking” data on a patient, and they have no record of hypertension diagnosis, then they really don't have the diagnosis. If their records don’t look “sufficiently” complete, we can consider it to be first-order missing data. This kind of thing might seem to make sense, but it also has a number of downsides that might cause serious problems down the line.
To name just a few questions we'd have to ask of this approach: How do we decide on the number of “complete-looking” years of data we need? We can talk to clinicians, and we can get a bit of insight from the data itself, but at the end of the day there will be something arbitrary about this choice. How do we decide if health records look "complete"? The same exact set of health records may represent one patient's full treatment history, and only a third of a different patient's history. And why would we think there's a difference between X years and X years plus one day? We'd probably want to allow for at least some kind of gradient.
For critical, health-related work, we'll want a more robust approach.
In the next post in our Data Science Deep Dive series, we'll discuss how we handle this problem at Delfina. If you’re passionate about using data to drive better pregnancy outcomes, check back with delfina.com/careers and keep up with us on LinkedIn for any openings. Thanks for reading along.