Healthcare Data Issues – Part 1 (Internal Data Modelling)

A series of articles penned by Dr. Navin Ramachandran, Dr. Wai Keong Wong and Dr. Ian McNicoll. We examine the current approaches to data modelling which have led to the issues we face with interoperability.

The other parts of the article can be found here: Part 2, Part 3.

The current healthcare data landscape is incredibly complex, with great variations between institutions, and even between systems within a single institution. This means that healthcare data can be incredibly difficult to collect and share. Here we explore the underlying problems.


Common Problematic Approaches To Medical Data Models And Storage

The major obstacles to efficient sharing of health information are that:

  • Data is stored in multiple different systems using many different technology stacks. While this fosters innovation and competition, it does of course make data sharing difficult.
  • Many companies benefit commercially from the lack of interoperability of their systems. Data is generally “locked-in” to a particular vendor solution.
  • Even where companies are minded to adopt an “open data” policy, the current technical and regulatory environment makes interoperability highly demanding.
  • Even if interconnectivity is resolved, the variable information models underlying the data storage severely limit the richness of transmitted data, since complex and clinically-risky transforms are required.


Multiplicity of information systems within a hospital

Within a single hospital, there are usually multiple systems which store data including:

  • PAS – patient administration system.
  • RIS – radiology information systems.
  • Pathology systems.
  • TomCat – Cardiology system.
  • Cancer reporting and management systems.

At some hospitals, this is extreme. For example, at University College London Hospitals NHS Trust, it is estimated that there are currently over 300 different such systems, some user-created (eg with MS Access). This is rightly seen by the IT department as an unsustainable long-term solution. But it reflects the inability of the main patient record system to capture the nuanced data that these different teams require.

Even in hospitals that have installed a large megasuite EHR such as Epic or Cerner, we often find multiple other systems being run concurrently (eg cancer management systems separate to Cerner EHR at The Royal Free Hospital). This is because the documentation accuracy of smaller “best of breed” solutions is often better than the equivalent module of the monolithic solution:

The continued development of the smaller systems tends to be more agile and reactive to the users’ needs, especially at a local level, than the monolithic system. The advantage of the monolithic system however is, at least theoretically the (more) consistent data model across the whole institution, allowing easier sharing of data. The “EHR” in this context has come to mean an all-encompassing solution, providing the varied application needs of a wide variety of users, along with a coherent data platform, all provided by a single vendor.

“Separating the data layer from the applications layer would allow us to maintain a consistent data model, while permitting more reactive / agile development of applications locally.”

Separating the data layer from the applications layer would allow us to maintain a consistent data model, while permitting more reactive / agile development of applications locally. A significant proviso is that the information models within the data layer must also be amenable to agile development, to meet this faster demand-cycle. Indeed new data models should be developed, adapted and incorporated into the data platform without any need to re-program the underlying database. New or updated data models must be implemented as a configuration change, not as a piece of new software development.

Inconsistent data models within a single institution

There are usually multiple different systems running concurrently with a hospital. These multiple systems each have their own data schema and information models. They may contain multiple tables which themselves may not be standardised in their nomenclature, across the single system.

For example, here is an extremely abbreviated dataset for a few of the many different systems at one theoretical hospital. Vertical columns contain fields for data collected in a system, the “ID” (pink) being the primary key:

Hosp 1 schema

If we were looking to extract the blood pressure (yellow) and temperature (red) readings, we can immediately see the difficulty, as the different systems often have different standards for field nomenclature and formatting:

  • Many different names for fields containing the same information type. eg “Temp” vs “Temperature”
  • Different data units / formatting. eg “BP = 120/80” vs “SBP=120 and DBP=80; “XXX XXX XXX” vs. “XXX-XXX-XXX”.

The data is often stored in multiple tables, with minimal relations / meanings established (usually only the primary key establishes any relationships or meaning). Therefore there is no way of addressing data in a field without knowing the exact path (table and field name).

Inconsistent data models across institutions

When we are considering transfer of data between institutions, the situation is even more complex. For example, these are theoretical datasets for 2 different hospitals:
Hosp 1 schemaHosp 2 schema

This illustrates the following issues that we may encounter:

  • Tables may:
    • Be present in some hospital / absent in others.
    • Have different names in different institutions.
    • Contain different fields in different institutions.
    • Contain fields with different formatting (eg XXX XXX XXX ; XXX-XXX-XXX).
    • Have different content or terminology bindings (eg ICD-10 vs SNOMED-CT).
  • Some specialities may not be present at some institutions, and therefore not have any corresponding data.
  • Even the same vendor often has different information models / data schema at different institutions – the core data such as demographics will likely be the same, but more personalised data for each department are often configured during deployment and personalised to the department.

Therefore again there is no way of addressing data in a field without knowing the exact path (table and field name). Furthermore, when this information is extracted and transferred to the next institution, it becomes difficult to map it to the correct fields on the second system.

Structured vs unstructured data

We have seen above how difficult it is to extract structured meaningful data from one system and then map it correctly to another system, due to the poor information modelling that is pervasive throughout healthcare.

In practice this structured transfer of information is nigh on impossible, and therefore the lowest common denominator is often used – a wholesale data dump of unstructured information. This is usually in the form of a PDF file, or even copy-pasted information into a spreadsheet / word-processed document.

Whilst this unstructured method can convey all the information required, it has very significant problems:

  • The most important information is often hidden amongst numerous pages of irrelevant data. This invariably leads to the information being incompletely processed or even ignored. Imagine a doctor parsing through a hundred pages of a pdf during a 15 minute consultation – this time would be better spent speaking to and examining the patient.
  • This data is very difficult to computationally analyse.
  • The copy-paste method is prone to significant error.

We must acknowledge that unstructured data itself can have its uses. Most recorded clinical notes are saved in an unstructured form, and this can help to convey other features of an encounter that are difficult to codify. For example, we may record the fact that the patient looks more dishevelled than usual at an appointment, which may be a marker for a significant change in mood in a patient with depression.

However these notes are often filled in using unique shorthand, which is different at each site, specific to subspecialties, and cannot easily be interpreted, even by sophisticated natural language processing algorithms. See the example below in yellow. This is typical of the recorded outcome from a cancer meeting regularly attended by one of the authors – it would make sense to the small group of people who attend the meeting, but very few people outside it.

Unstructured data

How did we get to the current situation?

Traditionally, most doctors and medical staff have worked with a paper medical record. Indeed the majority of clinical records worldwide are still recorded on paper. And though most hospitals within developed countries have access to at least a basic electronic record, paper entry still plays a large role.

Our early attempts at producing electronic models have been basic at best. In most cases, there has been no business process modelling, and the data schema are literally electronic copies of paper forms that have been used clinically.

“If the core infrastructure is inadequate, there is only so much patching you can do before you reach the limits of the platform.”

Most importantly there are no true information models underlying the data schema, to give them intrinsic meaning. This is a critical aspect that has been ignored in most systems, and we will discuss this fully later.

These basic schema have usually been expressed using the relatively limited functionality of relational databases where “the table is the data model”. When clinical pathways change over time, we are required to change the underlying data schema, but this can often break the relations in the table and therefore any changes can take a lot of time to implement (ie we have lost agility). Some leading systems use NoSQL solutions such as MUMPS (Epic, Intersystems, VA VISTA) which are more flexible, but the data models remain locked in to each system and the specific NoSQL database.

To cover the inadequacies / inefficiencies of these systems, many APIs and messaging standards have been developed, but we feel that they can only ever partly overcome the underlying problems. If the core infrastructure is inadequate, there is only so much patching you can do before you reach the limits of the platform.

To compound these problems, every institution / vendor has taken slightly different approaches to modelling their systems. The reasons for this are complex:

  • Some companies / institutions want to lock in their data to maintain competitive advantage. For example in the US, there is growing frustration that the big EHR vendors have built proprietary networks using public (HITECH) funding, and are charging exorbitant fees to build interfaces to access them.
  • The demands of local project timescales often make broad collaboration difficult.
  • Clinicians often exert considerable local power and insist on doing things “their way”. Clinical evidence is often lacking for such variation. They are also often not aware of such local variance.
  • Different branches of the profession many have quite different information granularity needs. For example the “Family history of breast cancer” will be recorded at different levels of depth in a patient-held record, a family doctor record and the records of a geneticist or research institution. For the information to flow between the systems, the models supporting these needs have to be aligned.
  • Recent changes in practice and technology mean that large volumes of data (“-omics” data such as a highly complex genetic profile – genomics) may need to be consumed and displayed to the patient. Therefore centres with the ability to process “omics” data now need the further ability to support a “big data” platform.

We are still relatively early on in the transition from paper to a fully electronic record, and we believe at a crucial point. The next steps will decide who controls the platforms and how easy it will be to innovate.

We also draw parallels to other systems that have already undergone this process. Within the medical field, PACS systems in medical imaging previously used proprietary models for their metadata. This metadata (including demographics and study details) is crucial to the study as it gives it meaning. It is only a few kilobytes in size per study and the total amount of metadata in a large PACS system will only be a few hundred gigabytes at the most.

As PACS systems have matured, we are now changing vendors and therefore migrating our data onto new systems. However due to the proprietary nature of the metadata, this has led to problematic corruption during transfer, which has resulted in significant clinical impact. Increasingly PACS platforms are required to be Vendor Neutral Archives (VNAs), based on standardised data and metadata formats to avoid vendor lock-in.

“We believe that current data storage should be within a vendor neutral architecture.”

When EHR platforms have matured to the same point, and people are considering moving to a second vendor, this migration will be a lot more problematic as we will be dealing with petabytes or more of meaningful data. Therefore we believe that current data storage should be within a vendor neutral architecture. This critical aspect is often underestimated, as the deleterious effects will only be evident in the medium term.


Over-reliance on terminologies alone

Terminologies such as LOINC, RxNorm, dm+d and SNOMED CT play a vital role in allowing shared meaning to be passed between systems, and help to power features such as clinical decision support.

More recently, terminologies have started to incorporate some aspects of “ontology” as a means of embedding even richer meaning into terminology, allowing complex computational analysis. This approach works well where the ontology is describing entities in the biomedical domain e.g. diseases, organisms, bodily structures, medications. These are scientific concepts grounded in “truth”, at least as far as is understood by current science.

Unfortunately the enthusiasm of many ontology researchers for this powerful approach has led to some promoting its use as a panacea for all clinical information modelling. But it is ill-suited for the description of the working arrangements, culturally and nationally determined clinical processes, and sheer human variation inherent in clinical practice, which largely evades logical analysis. In this regard the conceptualisation of medical practice has more in common with political or religious classification: ever-changing, nuanced and contentious. There are significant parallels with Clay Shirkey’s critical analysis of the Semantic Web:

In addition, the development of logical ontologies requires a deep understanding of description logics, and of tooling and trains of thought, that are well beyond the expertise of average clinicians, system developers, or even many experienced clinical informaticians. Even if logics provided the answer, we simply do not have enough clinical logicians to build and maintain the required ontologies.

Current clinical informatics thinking has retreated from attempts to apply ontology wholesale – preferring instead to find ways of making optimal use of terminologies and ontologies within the context of conventional information models. Over time, ontological methods will undoubtedly grow in significance and ease of use, but an over-concentration on this approach during the past decade has been to the detriment of of achieving practical healthcare interoperability.

Leave a Reply