Author: Navin Ramachandran

Breaking A Large Mindmap Into Smaller Mindmaps For Analysis (XMind)

This post expands on previous posts (see summary here) regarding the use of XMind for modelling data with openEHR. The aim is to document our processes in an open and reproducible manner.
Image Courtesy of Lesekreis (Own work) [CC0], via Wikimedia Commons

 

In the first step of our project, we collated all the different datapoints required for different registries and operational systems (we shall call each registry or system a “section” of the mindmap from now on). Many datapoints were duplicated across different sections, therefore we cross-referenced them to identify the overlap. The next step is to use that information to identify the unique datapoints we need to capture. As the main mindmap is usually very large and cumbersome at this point, we break this up into separate sub-mindmaps for analysis. This can be challenging as:

  • X-Mind does not support synchronisation between different mindmaps. Therefore if you make a change to any datapoint, you must ensure that the change is made to every mindmap in which the datapoint exists.
  • Multiple team members may be working on the mindmaps. Therefore if changes need to be made to the mindmaps, there should ideally be one person overseeing the changes to prevent multiple forks of the mindmaps.
  • We should ideally ensure that every datapoint has been copied across to the sub-mindmaps at the end of this process, which can be tricky when there are hundreds of datapoints. To facilitate this, we use a markerScreen Shot 2016-03-11 at 17.24.26 – aka “Modelling has started”) on each datapoint of the main mindmap, when it has been copied over to a sub-mindmap. Towards the end of the modelling process we can search the main mindmap to look for any datapoints which have not been marked in this way, and therefore are yet to be modelled.

 

After much discussion we agreed the following rules for breaking down the main mindmap into the sub-mindmaps.

  1. Decide on the initial categories for the sub-mindmaps e.g. Diagnosis, Pathology, Imaging, Demographics.

  2. Copy across the relevant datapoints to the sub-mindmap. It is useful to keep the datapoints from different sections on the main mindmap (eg COSD, Genomics), separated under similarly-named sections on the sub-mindmap.
  3. In the main mindmap (not the sub-mindmap), place a marker ( Screen Shot 2016-03-11 at 17.24.26 – aka “Modelling has started”) on each datapoint of the main mindmap, when it has been copied over to a sub-mindmap.
  4. When we believe we have copied across all the relevant items to a sub-mindmap, we can begin the analysis step.
    Begin from the top of the sub-mindmap and examine each datapoint. Using the labels and notes we can identify the duplicate datapoints that should have also been transferred from the other sections.
    If one of the duplicate datapoints has not been transferred to the sub-mindmap, copy it across at this point and make sure it is markedScreen Shot 2016-03-11 at 17.24.26 – aka “Modelling has started”) on the main mindmap.We then need to remove all the duplicates from the sub-mindmap, so we are only dealing with unique datapoints.
    It is good practice to nominate one section as your base dataset (eg COSD in cancer) – these datapoints will be preferentially preserved, and duplicates from other datasets removed. This will make it slightly easier to keep track of all the datapoints in the future.

  5. We should be aware that there may have been errors in data synchronisation during the previous steps of mindmapping. Therefore we are very careful when we delete duplicate datapoints. We should examine all the labels and notes of all the duplicate datapoints, and make sure all unique information is copied across to the datapoint we keep. This information (eg mappings to datapoints in different datasets, ordinality, issues to be resolved etc) is very useful for archetyping and future data mapping, so it is vital that it is not inadvertently lost.
    The notes should include all the registry codes to which the datapoint maps, including the name of the parent datapoint. eg in the COSD section, the datapoint name itself will contain the COSD code, but we should ensure that the attached notes also list the COSD code, for completeness.

  6. Once we have processed a datapoint as above (consolidated information from, and then deleted the duplicate datapoints), we mark the datapoint with a blue flag ( Screen Shot 2016-03-11 at 17.32.35  )to show that we have completed the analysis phase.
    But if there are any questions to resolve (eg regarding possible mappings, or perhaps even possible errors detected) we mark it with a red flag ( Screen Shot 2016-03-11 at 17.32.43 ).

  7. If we are unsure whether one of the duplicate fields is truly a duplicate, then do not delete it, and also mark it with a red flag ( Screen Shot 2016-03-11 at 17.32.43 ).  Eg if there are 4 definitely identical datapoints, and 2 further possible (but not definite) duplicates:
    – Keep one of the “definites” and the 2 “possibles”. Mark these with red flags.
    – Delete the other 3 datapoints (the redundant “definites”).

  8. After you have formed your main sub-mindmaps, go back and check if all datapoints on the main mindmap have a Screen Shot 2016-03-11 at 17.24.26 marker. If not, they have not yet been processed.
    If this unprocessed datapoint does not fit any of the categories of the existing sub-mindmaps, a new sub-mindmap may be needed to house these orphan datapoints (e.g. called “Miscellaneous”).

 

This approach has helped us to begin tackling the large datasets. However minor problems will undoubtedly arise, which will be documented below as they arise:

  • Some datapoints from different registries refer to the same data, but at different time points (eg “PSA at diagnosis” vs “PSA at start of treatment”). We would expect our longitudinal data model to handle both datapoints, using the same archetype. We should treat these as duplicate datapoints, but this timing information should be recorded in the notes section, as it will be useful for future mappings. We retain only one of the datapoints, but mark it with a red flag ( Screen Shot 2016-03-11 at 17.32.43 ) to indicate that the mapping will not be straightforward.
  • Some event metadata for items (eg “Event Date” in Genomics) can be difficult to map to other data points. We would expect this type of metadata to be handled by the reference model of openEHR, so we don’t dwell too long on this.

Healthcare Data Issues – Part 3 (Data Model vs Dataset)

A series of articles penned by Dr. Navin Ramachandran, Dr. Wai Keong Wong and Dr. Ian McNicoll. We examine the current approaches to data modelling which have led to the issues we face with interoperability.

The other parts of the article can be found here: Part 1, Part 2.

We have mentioned the terms “data model” and “dataset” many times. It is extremely important to understand the difference between the two terms as they have often been conflated, resulting in a lack of understanding of the importance of the data model. We believe that the absence of a proper data model in most medical systems has led to the current problems with interoperability.

The concept of the data model (vs a dataset) is a difficult one to grasp at first. Let’s look at the (more simple) concept of the dataset first. In a particular dataset, an organisation may stipulate the collection of blood pressure readings in the pre-surgery clinic. But these 2 values are just single readings and only some of the variables that can be defined for blood pressure. The blood pressure data model should account for all the potential variables related to blood pressure and their variance with time. For example, the blood pressure archetype that has been defined on openEHR (a mini data model – link: http://openehr.org/ckm/#showArchetype_1013.1.130_MINDMAP ) looks like this on a mind map:

BP mindmap - att desc

The basic blood pressure dataset can easily be mapped to the systolic and diastolic components of the data model (top right of the mind map). All the different components of the model do not have to be used, but are in place, ready for future expansion of the system.

If this data model is used across institutions, different systems may use different components of the model, but the relationships and meanings of each component would be defined from the outset, making transfer of information between systems easier.

“The data model should account for all the potential variables and their variance with time.”

To look at it in reverse, the dataset is actually an expression of the underlying data model. A 2-dimensional representation of a 3- or 4-dimensional entity. A good analogy is that the dataset is like a photograph, representing only a snapshot of a whole being at one moment in time. The data model describes the whole being. The dataset is just one view of it.

 

Descriptive_Zoopraxography_Athlete,_Running_Long_Jump_Animated_14

 

Healthcare Data Issues – Part 2 (Data Transfer)

A series of articles penned by Dr. Navin Ramachandran, Dr. Wai Keong Wong and Dr. Ian McNicoll. We examine the current approaches to data modelling which have led to the issues we face with interoperability. Main image from: Grandjean, Martin (2014). “La connaissance est un réseau”. Les Cahiers du Numérique 10 (3): 37–54. Retrieved 2014-10-15.

The other parts of the article can be found here: Part 1, Part 3.

The majority of current approaches to resolving data transfer issues revolve around either:

  • Replacing all the current systems in a region with a single vendor, using a single data model across the region. This circumvents the need for interoperability of systems.
  • Developing common datasets and messaging standards to maximise interoperability of systems.

Interoperability is a term that is often used in medical informatics, exalted as the answer to all our woes. However this interoperability can exist on a range from partial to complete. Unfortunately, we believe that even the most recent attempts at interoperability will only achieve partial success, and we explore the underlying reasons in this section.

 

Building a bigger silo

Also known as the monolithic model. In its most basic form, you would replace all the disparate systems in one centre with a single vendor solution:
Monolith 1
This has happened at many institutions, using solutions such as Cerner, Epic, Allscripts and Intersystems. However this does not resolve the issue of data transfer between hospitals or with community systems, therefore in some regions all the institutions have moved to the same vendor using a similar / identical data model:
Monolith 2
However this approach is ultimately problematic as:

  • It may be politically and technologically difficult to transition all the institutions to the same provider.
  • One system will not be able to satisfy the varied needs of all the different institutions. As we have identified previously, the documentation accuracy of smaller “best of breed” solutions is often better than the equivalent module of the monolithic solution, as they are more responsive to local needs. Therefore we often find multiple other smaller systems being run concurrently to the monolithic solution (eg cancer management systems separate to Cerner EHR at The Royal Free Hospital), which would break this monolithic model.
  • What happens outside this region? e.g How do San Francisco’s hospitals communicate with Los Angeles? The complexity of the healthcare environment means that there are always more boundaries, and the logical end-point of the single-system approach must always end in one single-system for an entire nation.
  • How will this system speak to social care, police, government systems? This and the previous issue are partially overcome by using common datasets and messaging standards, but that brings in the added problems of that solution too (see below).
  • The adverse effects of monopoly. The power of the apps revolution is in innovation and “fail fast”.

“The complexity of the healthcare environment means that there are always more boundaries, and the logical end-point of the single-system approach must always end in one single-system for an entire nation.”

 

Using common datasets, messaging standards and APIs (eg using FHIR)

This method works in the following manner. A central body defines the dataset for a particular medical condition or procedure and publishes it. Each individual institution then implements the dataset within its system, usually by mapping it to existing fields in existing tables, or adding new fields.

Central Dataset

The central body then defines messaging datasets for transmission of data between organisations – this may be the same as the initial dataset or a subset of this. This messaging dataset defines how and what data will be transmitted and received in different scenarios (eg for a referral form, for a discharge summary).

For example the Royal College of Surgeons has released a dataset for The National Prostate Cancer Audit: http://www.npca.org.uk/wp-content/uploads/2013/07/NPCA_MDS-specification_-V2-0_-17-12-14.pdf

A very important fact to note is that this is only a small portion of all the data that will be collected during their care, which has yet to be defined properly, and will in fact naturally vary between hospitals. Therefore if everything works well, the data which has been defined in the central dataset will be available for analysis and transfer. But the other peripheral data that has not been defined, will never be meaningfully stored or transmitted.

This is the major problem with this model, that only a very small dataset can be defined centrally – the lowest common denominator – which means that other potentially valuable data (which may provide great insight into causes of, or treatments for the disease) will be lost to analysis.

Furthermore, a lot of these central bodies are not versed in informatics and are still very much in thinking using the paper paradigm. Their datasets are often no more sophisticated than a digital representation of a paper form, and not derived from an overarching data model. Sadly this leads to a situation where the paper form is the dataset, and dataset is the data model!

“Only a very small dataset can be defined centrally – the lowest common denominator – which means that other potentially valuable data (which may provide great insight into causes of, or treatments for the disease) will be lost to analysis.”

Other significant drawbacks to this model are:

  • Loss of meaning at each stage.
    • When the central dataset is deployed locally, it has to be translated to the local data model. Then when it is transmitted, it is translated into the messaging model and then finally into the model of the receiving institution’s system. At each translation, there will likely be some loss of meaning of the data.

Transfer Message

  •  Updating pathways / datasets across hospitals, while maintaining functioning APIs / messaging standards.
    • An updated data model can take anywhere between 1 to 12 months to deploy in a hospital (this is the reality on the ground). This is because they have to be translated to the hospital systems’ data models and any systems interfaces and user interfaces updated, without breaking the underlying tables.
    • In the app world, there can be many app updates in 1 month. But if we very conservatively say that we may make 2 updates to an information model in a year, some hospitals may not have even deployed the first change by the end of the year. Therefore some hospitals would still be on version 1, others on version 2 and the most efficient on version 3. Therefore 3 different models would be in operation at the same time at different hospitals – which common API / messaging standard would now be the correct one to use?

 

The emergence of the granular API – HL7 FHIR

For the last decade cross-system interoperability in the US has largely focused on efforts to use the HL7v3 and CDA messaging standards. In spite of significant investment via the “Meaningful Use” program, useful interoperability has remained elusive, in part due to the complexity of the HL7v3 standard. In response the HL7 community developed FHIR (Fast Health Interoperability Resources) adopting modern software approaches such as granular APIs and a RESTful interface, and in line with influential industry opinion as per the JASON report.

FHIR has found favour amongst most of the major US system vendors, represented by the Argonaut Project, and without doubt is a major step forward in reducing the barriers to ease of data exchange between systems.

http://www.hl7.org/documentcenter/public_temp_EF13C1F7-1C23-BA17-0CB343BB6F370C99/pressreleases/HL7_PRESS_20141204.pdf

We strongly support the industry-led adoption of FHIR, nevertheless, in our view, the scope of FHIR falls short of the changes required to support a vibrant eHealth industry.

  • FHIR is expressly engineered to support data exchange not data persistence and querying.
  • Support for each new FHIR resource needs to be programmed by each vendor, unlike openEHR where new archetype models are consumed automatically (i.e in openEHR, data conforming to these new models can be immediately stored and queried without any re-programming).
  • FHIR is expressly designed to meet only key, common interoperability areas. It does have a capacity for extension, but this remains immature and beyond the core scope of the project.
  • FHIR resources remain technical artefacts, whilst archetypes are designed to be designed and maintained by clinicians.

This assessment may seem to put FHIR directly in opposition to openEHR but our view is that each, playing to its strength, makes for a compelling combination. openEHR can provide the methodology and toolset to build the clinical content definitions needed to underpin FHIR – this is the approach being used by NHS England.

“openEHR can provide the methodology and toolset to build the clinical content definitions needed to underpin FHIR.”

Healthcare Data Issues – Part 1 (Internal Data Modelling)

A series of articles penned by Dr. Navin Ramachandran, Dr. Wai Keong Wong and Dr. Ian McNicoll. We examine the current approaches to data modelling which have led to the issues we face with interoperability.

The other parts of the article can be found here: Part 2, Part 3.

The current healthcare data landscape is incredibly complex, with great variations between institutions, and even between systems within a single institution. This means that healthcare data can be incredibly difficult to collect and share. Here we explore the underlying problems.

 

Common Problematic Approaches To Medical Data Models And Storage

The major obstacles to efficient sharing of health information are that:

  • Data is stored in multiple different systems using many different technology stacks. While this fosters innovation and competition, it does of course make data sharing difficult.
  • Many companies benefit commercially from the lack of interoperability of their systems. Data is generally “locked-in” to a particular vendor solution.
  • Even where companies are minded to adopt an “open data” policy, the current technical and regulatory environment makes interoperability highly demanding.
  • Even if interconnectivity is resolved, the variable information models underlying the data storage severely limit the richness of transmitted data, since complex and clinically-risky transforms are required.

 

Multiplicity of information systems within a hospital

Within a single hospital, there are usually multiple systems which store data including:

  • PAS – patient administration system.
  • RIS – radiology information systems.
  • Pathology systems.
  • TomCat – Cardiology system.
  • Cancer reporting and management systems.

At some hospitals, this is extreme. For example, at University College London Hospitals NHS Trust, it is estimated that there are currently over 300 different such systems, some user-created (eg with MS Access). This is rightly seen by the IT department as an unsustainable long-term solution. But it reflects the inability of the main patient record system to capture the nuanced data that these different teams require.

Even in hospitals that have installed a large megasuite EHR such as Epic or Cerner, we often find multiple other systems being run concurrently (eg cancer management systems separate to Cerner EHR at The Royal Free Hospital). This is because the documentation accuracy of smaller “best of breed” solutions is often better than the equivalent module of the monolithic solution:

http://www.informationweek.com/healthcare/electronic-health-records/physicians-prefer-best-of-breed-emergency-department-ehrs/d/d-id/1108610

The continued development of the smaller systems tends to be more agile and reactive to the users’ needs, especially at a local level, than the monolithic system. The advantage of the monolithic system however is, at least theoretically the (more) consistent data model across the whole institution, allowing easier sharing of data. The “EHR” in this context has come to mean an all-encompassing solution, providing the varied application needs of a wide variety of users, along with a coherent data platform, all provided by a single vendor.

“Separating the data layer from the applications layer would allow us to maintain a consistent data model, while permitting more reactive / agile development of applications locally.”

Separating the data layer from the applications layer would allow us to maintain a consistent data model, while permitting more reactive / agile development of applications locally. A significant proviso is that the information models within the data layer must also be amenable to agile development, to meet this faster demand-cycle. Indeed new data models should be developed, adapted and incorporated into the data platform without any need to re-program the underlying database. New or updated data models must be implemented as a configuration change, not as a piece of new software development.

Inconsistent data models within a single institution

There are usually multiple different systems running concurrently with a hospital. These multiple systems each have their own data schema and information models. They may contain multiple tables which themselves may not be standardised in their nomenclature, across the single system.

For example, here is an extremely abbreviated dataset for a few of the many different systems at one theoretical hospital. Vertical columns contain fields for data collected in a system, the “ID” (pink) being the primary key:

Hosp 1 schema

If we were looking to extract the blood pressure (yellow) and temperature (red) readings, we can immediately see the difficulty, as the different systems often have different standards for field nomenclature and formatting:

  • Many different names for fields containing the same information type. eg “Temp” vs “Temperature”
  • Different data units / formatting. eg “BP = 120/80” vs “SBP=120 and DBP=80; “XXX XXX XXX” vs. “XXX-XXX-XXX”.

The data is often stored in multiple tables, with minimal relations / meanings established (usually only the primary key establishes any relationships or meaning). Therefore there is no way of addressing data in a field without knowing the exact path (table and field name).

Inconsistent data models across institutions

When we are considering transfer of data between institutions, the situation is even more complex. For example, these are theoretical datasets for 2 different hospitals:
Hosp 1 schemaHosp 2 schema

This illustrates the following issues that we may encounter:

  • Tables may:
    • Be present in some hospital / absent in others.
    • Have different names in different institutions.
    • Contain different fields in different institutions.
    • Contain fields with different formatting (eg XXX XXX XXX ; XXX-XXX-XXX).
    • Have different content or terminology bindings (eg ICD-10 vs SNOMED-CT).
  • Some specialities may not be present at some institutions, and therefore not have any corresponding data.
  • Even the same vendor often has different information models / data schema at different institutions – the core data such as demographics will likely be the same, but more personalised data for each department are often configured during deployment and personalised to the department.

Therefore again there is no way of addressing data in a field without knowing the exact path (table and field name). Furthermore, when this information is extracted and transferred to the next institution, it becomes difficult to map it to the correct fields on the second system.

Structured vs unstructured data

We have seen above how difficult it is to extract structured meaningful data from one system and then map it correctly to another system, due to the poor information modelling that is pervasive throughout healthcare.

In practice this structured transfer of information is nigh on impossible, and therefore the lowest common denominator is often used – a wholesale data dump of unstructured information. This is usually in the form of a PDF file, or even copy-pasted information into a spreadsheet / word-processed document.

Whilst this unstructured method can convey all the information required, it has very significant problems:

  • The most important information is often hidden amongst numerous pages of irrelevant data. This invariably leads to the information being incompletely processed or even ignored. Imagine a doctor parsing through a hundred pages of a pdf during a 15 minute consultation – this time would be better spent speaking to and examining the patient.
  • This data is very difficult to computationally analyse.
  • The copy-paste method is prone to significant error.

Note:
We must acknowledge that unstructured data itself can have its uses. Most recorded clinical notes are saved in an unstructured form, and this can help to convey other features of an encounter that are difficult to codify. For example, we may record the fact that the patient looks more dishevelled than usual at an appointment, which may be a marker for a significant change in mood in a patient with depression.

However these notes are often filled in using unique shorthand, which is different at each site, specific to subspecialties, and cannot easily be interpreted, even by sophisticated natural language processing algorithms. See the example below in yellow. This is typical of the recorded outcome from a cancer meeting regularly attended by one of the authors – it would make sense to the small group of people who attend the meeting, but very few people outside it.

Unstructured data

How did we get to the current situation?

Traditionally, most doctors and medical staff have worked with a paper medical record. Indeed the majority of clinical records worldwide are still recorded on paper. And though most hospitals within developed countries have access to at least a basic electronic record, paper entry still plays a large role.

Our early attempts at producing electronic models have been basic at best. In most cases, there has been no business process modelling, and the data schema are literally electronic copies of paper forms that have been used clinically.

“If the core infrastructure is inadequate, there is only so much patching you can do before you reach the limits of the platform.”

Most importantly there are no true information models underlying the data schema, to give them intrinsic meaning. This is a critical aspect that has been ignored in most systems, and we will discuss this fully later.

These basic schema have usually been expressed using the relatively limited functionality of relational databases where “the table is the data model”. When clinical pathways change over time, we are required to change the underlying data schema, but this can often break the relations in the table and therefore any changes can take a lot of time to implement (ie we have lost agility). Some leading systems use NoSQL solutions such as MUMPS (Epic, Intersystems, VA VISTA) which are more flexible, but the data models remain locked in to each system and the specific NoSQL database.

To cover the inadequacies / inefficiencies of these systems, many APIs and messaging standards have been developed, but we feel that they can only ever partly overcome the underlying problems. If the core infrastructure is inadequate, there is only so much patching you can do before you reach the limits of the platform.

To compound these problems, every institution / vendor has taken slightly different approaches to modelling their systems. The reasons for this are complex:

  • Some companies / institutions want to lock in their data to maintain competitive advantage. For example in the US, there is growing frustration that the big EHR vendors have built proprietary networks using public (HITECH) funding, and are charging exorbitant fees to build interfaces to access them. http://geekdoctor.blogspot.co.uk/2015/06/so-what-is-interoperability-anyway.html
  • The demands of local project timescales often make broad collaboration difficult.
  • Clinicians often exert considerable local power and insist on doing things “their way”. Clinical evidence is often lacking for such variation. They are also often not aware of such local variance.
  • Different branches of the profession many have quite different information granularity needs. For example the “Family history of breast cancer” will be recorded at different levels of depth in a patient-held record, a family doctor record and the records of a geneticist or research institution. For the information to flow between the systems, the models supporting these needs have to be aligned.
  • Recent changes in practice and technology mean that large volumes of data (“-omics” data such as a highly complex genetic profile – genomics) may need to be consumed and displayed to the patient. Therefore centres with the ability to process “omics” data now need the further ability to support a “big data” platform.

We are still relatively early on in the transition from paper to a fully electronic record, and we believe at a crucial point. The next steps will decide who controls the platforms and how easy it will be to innovate.

We also draw parallels to other systems that have already undergone this process. Within the medical field, PACS systems in medical imaging previously used proprietary models for their metadata. This metadata (including demographics and study details) is crucial to the study as it gives it meaning. It is only a few kilobytes in size per study and the total amount of metadata in a large PACS system will only be a few hundred gigabytes at the most.

As PACS systems have matured, we are now changing vendors and therefore migrating our data onto new systems. However due to the proprietary nature of the metadata, this has led to problematic corruption during transfer, which has resulted in significant clinical impact. Increasingly PACS platforms are required to be Vendor Neutral Archives (VNAs), based on standardised data and metadata formats to avoid vendor lock-in.

“We believe that current data storage should be within a vendor neutral architecture.”

When EHR platforms have matured to the same point, and people are considering moving to a second vendor, this migration will be a lot more problematic as we will be dealing with petabytes or more of meaningful data. Therefore we believe that current data storage should be within a vendor neutral architecture. This critical aspect is often underestimated, as the deleterious effects will only be evident in the medium term.

 

Over-reliance on terminologies alone

Terminologies such as LOINC, RxNorm, dm+d and SNOMED CT play a vital role in allowing shared meaning to be passed between systems, and help to power features such as clinical decision support.

More recently, terminologies have started to incorporate some aspects of “ontology” as a means of embedding even richer meaning into terminology, allowing complex computational analysis. This approach works well where the ontology is describing entities in the biomedical domain e.g. diseases, organisms, bodily structures, medications. These are scientific concepts grounded in “truth”, at least as far as is understood by current science.

Unfortunately the enthusiasm of many ontology researchers for this powerful approach has led to some promoting its use as a panacea for all clinical information modelling. But it is ill-suited for the description of the working arrangements, culturally and nationally determined clinical processes, and sheer human variation inherent in clinical practice, which largely evades logical analysis. In this regard the conceptualisation of medical practice has more in common with political or religious classification: ever-changing, nuanced and contentious. There are significant parallels with Clay Shirkey’s critical analysis of the Semantic Web: http://www.shirky.com/writings/herecomeseverybody/semantic_syllogism.html

In addition, the development of logical ontologies requires a deep understanding of description logics, and of tooling and trains of thought, that are well beyond the expertise of average clinicians, system developers, or even many experienced clinical informaticians. Even if logics provided the answer, we simply do not have enough clinical logicians to build and maintain the required ontologies.

Current clinical informatics thinking has retreated from attempts to apply ontology wholesale – preferring instead to find ways of making optimal use of terminologies and ontologies within the context of conventional information models. Over time, ontological methods will undoubtedly grow in significance and ease of use, but an over-concentration on this approach during the past decade has been to the detriment of of achieving practical healthcare interoperability.

OpenCancer now on NHS Code4Health

This has been a busy couple of months, both clinically and project-wise, leaving barely any time for updating the website. However, we are very pleased to announce that OpenCancer has been certified as an official NHS England Code4Health community:

https://code-4-health.org/opencancer

In the spirit of shared working, we have already begun to forge close working relationships with other Code4Health partners – these will be announced shortly. We also hope to host some services on the Code4Health platform – again, watch this space.

Exciting times indeed!

openEHR Modelling with XMind – Part 3 (Summary)

Part 3 is a quick reference summary of the more detailed descriptions in Part 1 and Part 2.

In this series of posts we document how we began the process of modelling our data using mindmaps. The aim is to document our processes in an open and reproducible manner.

 

Summary

  • Mindmap nodes:
    • The registry code for a data point is documented within the name of the node – eg “CR0520 – T category (final pretreatment)”.
    • The data point name is the often lowest level node in the tree – ie there are no further “daughter nodes” beneath the data point name.
      • However some modellers list the constrained value set of the data point using daughter nodes, grouped together with a boundary box.

 

  • Labels:
    • These are used as tags for searching. There can only be one label per node, but this can contain multiple tags.
    • Tags should be one word long (but words can be concatenated eg “COSDcore”) and separated by commas.
    • Document the different datasets / registries in which the datapoint resides via these labels.
      • Eg if a datapoint is in both “COSD core” and “Genomics”, it would be listed under both parent nodes (with the correct corresponding field code in each section), and would be tagged with both “COSDcore” and “Genomics” each time – ie “COSDcore, Genomics”.
    • Also document the chosen archetype with a label corresponding to the archetype name – eg “CLUSTER.tnm_staging_7th-prostate.v1”.

 

  • Notes:
    • Used for longer text notes about the data point.
    • We recommend that you put any codes from other data points in this section (eg “CR0620 in COSD is equivalent to 14944.1 in Genomics”).
    • You may also choose to document information about the field type, possible candidate archetypes, or background information including definitions of terms. If boundary boxes have not been used, the constrained value set may be defined here (eg Male / Female / Other).

 

  • Comments:
    • The comments section should only be used for discussion between different people working on the project. If any aspect of the data point needs to be changed, it should be documented within the name field, the labels, comments or markers.

 

  • Markers:
    • Please use the custom markers we have provided (“openEHR-infogather.xmp” – a zip file containing the file can be found here).
    • full_match  Full match – available archetype usable without modification.
    • no_match  No match – new archetype needed.
    • partial_match  Partial match – available archetype usable, but will require modification.
    • indeterminate  Indeterminate – further analysis required.

 

More detailed descriptions can be found in Part 1 and Part 2 of this article.

openEHR Modelling with XMind – Part 2

Part 1 of this series can be found here. 

In this series of posts we document how we began the process of modelling our data using mindmaps. The aim is to document our processes in an open and reproducible manner.

In Part 1 we examined the initial stages of mindmapping. Here in Part 2, we look at the next steps, leading up to archetyping in openEHR.

 

Analysing the data points and identifying overlaps

The 2 images below display the different annotations we can use during the analysis phase. Our convention is to use labels, notes and comments for different functions.

labels notes comments

labels notes comments - labelled

 

Labels

These are added to any datapoint by pressing F3, via the drop down menus at the top of the page (Modify > Label), or by right clicking on a data point (Insert > Label). Only one label can be added per node, but this may contain multiple different tags – these are listed in a box beneath the name of the data point.

labels notes comments

For example, in the image above the data point CR0620 has been labelled with the tags “COSDcore” and “Genomics”. This indicates that the data point is present in both the COSD Core dataset and the Genomics dataset. Using the tags therefore allows us to highlight overlaps in datasets.

To keep things simple, please use single words without spaces for the tags (though you can use concatenated words – eg the terms “COSD” & “core” together become “COSDcore”). The tags are separated by commas, as seen in the image above.

Note:

  1. These tags can then be used to search through or filter down the data points, which is why they are so useful.
  2. The tags are also used for assigning to archetypes – see later.
  3. XMind automatically reorders the tags into alphabetical order.

 

Notes

These are added to any datapoint by pressing F4, via the drop down menus at the top of the page (Modify > Notes), or by right clicking on a data point (> Notes). A small icon Notes icon appears to the right of the data point name, when there is a comment attached. Click on the icon to reveal the notes, as on the image below.

Notes

The “notes” function is used for longer text notes about the data point. We recommend that you put any codes from other data points in this section (eg in the above image CR0620 in COSD is equivalent to 14944.1 in Genomics).

You may also choose to document information about the field type including value sets (eg Male / Female / Other) in the notes section.

 

Comments

These are added to any datapoint by clicking on the icon in the top toolbar Add comments icon , via the drop down menus at the top of the page (Modify > Comments), or by right clicking on a data point (> Comments). A small icon Comments icon appears to the right of the data point name, when there is a comment attached. Click on the icon to reveal the comments, as on the image below.

comments

Multiple comments can be added by different people. This section is therefore used for discussion during collaboration.

comments 2

The comments section should only be used for discussion between different people working on the project. If any aspect of the data point itself needs to be changed, it should be documented using the name field, the labels, comments or markers.

 

Identification of commonality with existing archetypes

We aim to Identify which data points can be collected using archetypes that are already available, and which cannot.

  1. Some data points will be completely covered by archetypes that have already been published (full match).
  2. Some data points will require modification of available archetypes (partial match).
  3. Some data points will require the development of new archetypes (no match).

To document this process we have created a series of markers which can be used in XMind. The markers can be imported by opening the file “openEHR-infogather.xmp” – a zip file containing the file can be found here. The markers should be used in the following way:

full_match  Full match – available archetype usable without modification.

no_match  No match – new archetype needed.

partial_match  Partial match – available archetype usable, but will require modification.

indeterminate  Indeterminate – further analysis required.

The easiest way to use the markers is to open the Markers window – via the drop down menus at the top of the page (Window > Markers). The imported markers are usually at the bottom of the window, under the heading “openEHR tasks”.

Markers window

The markers can be inserted by clicking on the node and then clicking on the required marker. The results will look similar to the image below.

Markers

The name of a relevant archetype can also be added as a label – see the labels under CR0620 on the following image.

archetype

 

And there you are! An outline of how XMind can be used for modelling in openEHR. A summary of the processes outlined in Parts 1 & 2 can be found in Part 3.

openEHR Modelling with XMind – Part 1

Introduction

In this series of posts we document how we began the process of modelling our data using mindmaps. The aim is to document our processes in an open and reproducible manner.

Clinical modelling can appear daunting at first. To reduce the complexity, we break the process down into different steps:

  1. Define the data points. eg the datasets from different registries; the different variables being collected in a study.
  2. Identify if data points from the different sources overlap. For example demographic details are usually collected in lots of different registries. e.g. COSD, SACT and RTDS all collect demographic data for the same patient, though the details may vary slightly (eg address at diagnosis vs address during treatment).
  3. After identifying the overlaps, we define the unique data points that need to be collected.
  4. Identify which data points can be collected using archetypes which are already available, and which cannot:
    1. Some data points will be completely covered by archetypes that have already been published.
    2. Some data points will require modification of available archetypes.
    3. Some data points will require the development of new archetypes.
  5. Develop the new archetypes and modify existing archetypes as needed.
  6. Model the business process of data collection (eg how data is processed / collected in the context of an MDT).
  7. Produce templates (corresponding to the forms needed to collect at each point of the pathway) using the archetypes.

This is a wide variation in the way that people approach steps 1 to 4. A mindmapping tool is often used for this purpose – we recommend XMind, available on Windows, Mac and Linux, as well as portable packages. In this series of articles, we describe a standard process for using XMind to model in openEHR.

 

Defining the data points

The aim of the mind map is to give a good overview of how the data is structured, such that other modellers or domain experts can understand the relationship of the data points.

In urological cancer, we began by listing the different registries and data repositories as well as the needs of the different medical specialties – we plotted these as different nodes on the mindmap.

Level 1

We then explored each of these areas to define the datasets. Below we expand on the different sections of the COSD dataset, forming subsets of further daughter nodes until we get to the data points themselves.

Level 2

Level 3

Level 4

Below is a close up of the staging section from the image above, containing the data point names.

Level 4 close up

Our convention is to document any codes for the data points within the node name – eg in the image above, “T category (final pretreatment)” has a code of CR0520 in the COSD core dataset.
Note: the text in yellow are labels attached to each datapoint (see below).

The data point name is usually the lowest level node in the tree – ie no further “daughter nodes” to the right of the data point name.

However, some modellers like to document the value sets for some data points on the next level down using a “boundary” function. Boundaries are boxes surrounding a group of nodes – a boundary can be created by highlighting the nodes and pressing Ctrl-B, by clicking the icon on the top toolbar Boundary icon , via the drop down menus at the top of the page (Insert > Boundary), or by right clicking on a data point (> Boundary).

The example below shows the use of boundaries to define value sets:

  1. The data point “Joint” can have a value of “Left elbow”, “Right elbow”, “Left knee”, “Right knee”, “Left ankle” or “Right ankle”.
  2. The data point “Duration” can have a value of either “0: No swelling or less than 6 months” or “1: Greater than or equal to 6 months”.

Boundaries

This latter approach (using boundaries) can be very useful when dealing with small projects / mindmaps. However they can result in a more cluttered appearance in larger mindmaps. Therefore some modellers break down the large mindmap into smaller mindmaps in the later stages of the project.

 

Part 2 of this series can be found here. In Part 2 we discuss how to categorise the data points, identify links between different datasets and use the mindmap to guide archetyping in openEHR.

 

OpenCancer is born!

OpenCancer was founded to tackle the problems we face with clinical data collection in cancer. There is universal agreement that good data collection underpins modern medical practice, yet we struggle to achieve this. OpenCancer aims to tackle a few of the root causes:

  1. Develop open data models that match clinical, research and reporting needs.
  2. Share best practice in the development of clinical pathways.
  3. Develop reusable technological elements by providing tools and APIs for use cases such as prognostic scoring and auto-identification of trial patients.

Data Modelling

We pride ourselves on a very detailed level of cancer data collection in the NHS. However, this process has been allowed to proliferate in an uncontrolled manner such that different registries stipulate data items that are confusingly similar, but not the sameBy example for a single cancer patient we are asked / obliged to submit closely related data fields to the following registries:

  • Cancer Outcomes and Services Dataset (COSD)
  • Systemic Anti-Cancer Therapy (SACT)
  • National Radiotherapy Dataset (RTDS)
  • National audits such as The National Prostate Cancer Audit (NPCA) and The British Association of Urological Surgeons (BAUS).

There are also newer projects such as The 100,000 Genomes project and The NIHR Health Informatics Collaborative, which overlap with these datasets and mandate the collection of even more data.

Very few hospitals collect this data in clinically-facing systems. Instead we separate this process into the back office, often employing non-clinical staff using non-clinical systems. This results in a significant overhead to the delivery of clinical care, and invariably leads to inconsistencies in data between the different systems.

“Data collection remains problematic in medicine.”

This has to change.

The only sustainable solution is to collect this data as a consequence of the delivery of care. This would not only address the issues of administrative overhead and data provenance, but would incentivise the clinical team to capture good quality data.

“If the data I enter is useful to me and my patient (and not just used for a registry), I will make sure I enter high quality data.”

The data would be collected once and reused through the clinical journey, but would also feed many different reporting datasets.  To achieve this goal, we must agree one data model for cancer, which needs to extend and be compatible with the data model for the delivery of all aspects of care across the NHS. This data model would be used in all clinical systems which collect cancer data, and would also underpin the structure of the data used for registry reporting and research.

“We need a single information model which underpins clinical care, registry data collection and research.”

Clinical Pathways

To translate the information models to real world use, one must consider the different steps of the clinical pathway, and the role of data at each step. Eg At the point of referral from the GP:

  • What data does the GP need to send to allow initial triage?
    • Can this be extracted automatically from the GP system?
    • Is there extra information (such as patient preference) which needs to be entered manually?
      • Who will enter this data?
  • What additional tests may be useful before the initial clinical visit?
    • Can these be requested automatically?
    • Can we pull these results in automatically?

“This pathway modelling is just as important as the information modelling to ensure good data capture, and most importantly good clinical care.”

With OpenCancer, we aim to document and share exemplars of good clinical modelling, to maximise learning from these projects. These pathway models may act as templates for institutions who are to undergo a similar service redesign.

Reusable Resources (including APIs)

We aim to provide useful web-based resources for both patients and healthcare professionals, including:

  • Prognostic and risk scoring calculators.
  • Clinical trial eligibility checkers.
  • Coded lists for common cancer related classification systems such as the CTCAE (Common Terminology Criteria for Adverse Events).

These will be provided as:

  • Simple web pages and apps for patients and clinicians.
  • APIs for incorporation of this functionality into third party apps, generic electronic patient record systems or specialist systems.

Currently every health IT platform, from single departmental systems to megasuite electronic health records, has to build this functionality for their own system. This is expensive, inefficient and ultimately unsustainable at a national scale.

The provision of these tools at a national level has the potential to improve healthcare in an equitable fashion, and accelerate the benefits of the digitisation of cancer care.