Longitudinal studies are often used to identify patient trends or patterns and to measure the efficacy of drugs or treatments. It requires patient data from clinical as well as payer systems. Often this data is procured from specialized sources in a de-identified form. As the de-identified data lacks key patient identifiers, matching patients across these datasets is a difficult task. However, without linking the patients – this type of study is impossible. Depending on the level of de-identification applied to the data, the matching process may need significant computation.
Additionally, the databases for clinical data and payer data do not have standardized representations for certain fields or values, thus making the research more challenging.
Hence patient linkage has become a key consideration in data science today.
Data Model – Why OMOP CDM?
These data sets are completely different in schema, and almost always in different formats thus slowing effective research. Data standardization plays a critical role in allowing two data sets to be transformed into a common format, allowing easier research and analytics as well as sharing sophisticated tools and methodologies. The Common Data Model created by the Observational Medical Outcomes Partnership (OMOP CDM) ensures the interoperability of distinct data sets. It is important to use the same information model to avoid ambiguity in concepts, vocabulary and other areas.
Approach to Linkage
There are two algorithms generally used for linking datasets viz. the probabilistic and the deterministic algorithm. The probabilistic linkage is also known as ‘fuzzy matching’ and relies on a wide list of potential identifiers. These identifiers are then assigned weightage based on their estimated ability to accurately identify a match. In this case, pairs of records with probability scores above a certain level are considered as match while the rest are considered non-match with certain thresholds signifying ‘possible’ match.
Deterministic linkage, also called ‘rules-based’ linkage. It identifies links based on the number of respective identifiers that match across datasets. It is an iterative process that runs certain conditions through the data, and then repeats the process with different conditions where each step is more constrictive than the last and it is often known as ‘An all or nothing’ method of linking.
To decide on an approach for the problem of linking patients across discrete data sets, the following questions can help:
- Will the field of data used be present in another dataset of the same type?
- Can the algorithm be used again?
- Which fields are available for use?
Probabilistic linkage is often more accurate, however it also relies on several data fields that are generally unavailable in de-identified data sets such as: SSN, First Name, etc.
Additionally, probabilistic linkage requires a more intensive, time consuming and data specific process. Although the results tend to be more accurate, they require far more time and resources, and are data set specific thus they are not reusable.
Deterministic linkage on the other hand follows an iterative process which passes the data through a series of conditions that gradually get more constraining to increase accuracy.
There are two steps in the linkage process. The first step called ‘Filter one’ uses the blocking method, which essentially tries to target the largest number of potential matches using the fewest conditions. There are five general parameters used in ‘Filter one’:
- Visit Start Date
- Visit End Date
- Year of Birth
- First 3 Digits of Zip Code
In our raw data set, a total of 2,4,11,027 rows of records were found overlapping among the EMR and the Claims records, for the same patient ID – irrespective of other parameters, with approximately 9,500 unique overlapping patient IDs. After implementing ‘Filter one’, our patient ID pool fell to 3,657 distinct overlapping patients, with 27,254 rows of records.
However, the accuracy of these results is decidedly low. A sample result would look like this:
* A male patient born in 1974 whose visit start date is 15/04/2012 and whose visit end date is 19/04/2012 in the zip code 159###
While the example above is relatively specific, it isn’t 100% accurate in that there may have been more than three results for a particular search – when in real life one person cannot actually be three different people.
The results of ‘Filter one’ resulted in our final pool of potential patient matches satisfying the minimum conditions required to create a match with the data provided. This will now serve to be our denominator when calculating the accuracy of the algorithm. We then move to implement ‘Filter two’, which is slightly more specific. ‘Filter two’ uses an algorithm that cross references a potentially matching patient across the EMR and the Claims data and checks the number of diagnoses codes that match.
While we are unable to improve the overall number of matches – that is the number will never exceed the number of patients in the post ‘Filter one’ pool, we can improve on the accuracy of results within the pool. At this stage our accuracy is only at about 50%-60% which is approximately 2,000 correctly matched patients, however, once we implement filter to the results change drastically.
The table below shows what the results would look like for filter two.
After cross referencing results with at least one matching diagnosis code across Claims and EMR with the patients that were matches from ‘Filter one’, we get to our final results of 3,296 patient matches, of which around 80% were true matches. Our number of correctly matched patients increased from 2,000 to 2,500!
The breakdown of matches is shown in the table below:
This number may seem like a small percentage of our original pool of 27,000 patients and about 9,000 matches. However, the actual number we consider as our original pool is the number of patients with matches after passing through ‘Filter one’, as that was the minimum requirement for matching patients. While we may have missed about 6,000 other true matches – the data provided simply would not have allowed us to match those patients, despite the fact that they were the same people.
The funnel chart below details the entire process of linkage from our original pool to the final results.
Variable Impact Analysis
Considering the above study, an important question that arises is: which variables played the most important roles?
So we conducted an impact analysis to determine the impact each variable had on the accuracy and number of matches. The process we followed was to implement ‘Filter one’ i.e. (Gender, Start Date, End Date, Year of Birth, and Zip code) filtering four times, each time leaving one variable out. We skipped year of birth, without which matches would be near impossible.
The percentage per variable is the sum of the number of true record matches, divided by the sum of total “matches” that resulted in the implementation of the modified filter one.
The results are shown in the table below:
This table shows that without gender as one of our variables, the accuracy of the results would have been 38.5%, while start date and end data separately seem least important – the integral factor to keep in mind is that having at least one of the two is very important to the algorithm. Finally, the lowest score of 0.3% accuracy was found when the First Three Digits of Zip code variable was excluded. This conclusively shows that, when considering the four variables above, Zip code had the greatest impact on the accuracy of the linkage algorithm.
Additionally, different data sets provide varied information – therefore it is possible to have information such as NPI, Physician Specialty or Disease Related Groups which, had they been present in our data set would have significantly improved the results.
Summary and Conclusion
The deterministic algorithm used could have been further adjusted or tweaked to improve results based on the data available. In this case however, the final result of running this algorithm would guarantee you 3,300 patient matches, of which 80% would be actual matches. Essentially, four out of every five matches would actually end up being the same patient, while the last match would just be a very similar case!