This post expands on previous posts (see summary here) regarding the use of XMind for modelling data with openEHR. The aim is to document our processes in an open and reproducible manner.
Image Courtesy of Lesekreis (Own work) [CC0], via Wikimedia Commons
In the first step of our project, we collated all the different datapoints required for different registries and operational systems (we shall call each registry or system a “section” of the mindmap from now on). Many datapoints were duplicated across different sections, therefore we cross-referenced them to identify the overlap. The next step is to use that information to identify the unique datapoints we need to capture. As the main mindmap is usually very large and cumbersome at this point, we break this up into separate sub-mindmaps for analysis. This can be challenging as:
- X-Mind does not support synchronisation between different mindmaps. Therefore if you make a change to any datapoint, you must ensure that the change is made to every mindmap in which the datapoint exists.
- Multiple team members may be working on the mindmaps. Therefore if changes need to be made to the mindmaps, there should ideally be one person overseeing the changes to prevent multiple forks of the mindmaps.
- We should ideally ensure that every datapoint has been copied across to the sub-mindmaps at the end of this process, which can be tricky when there are hundreds of datapoints. To facilitate this, we use a marker( – aka “Modelling has started”) on each datapoint of the main mindmap, when it has been copied over to a sub-mindmap. Towards the end of the modelling process we can search the main mindmap to look for any datapoints which have not been marked in this way, and therefore are yet to be modelled.
After much discussion we agreed the following rules for breaking down the main mindmap into the sub-mindmaps.
- Decide on the initial categories for the sub-mindmaps e.g. Diagnosis, Pathology, Imaging, Demographics.
- Copy across the relevant datapoints to the sub-mindmap. It is useful to keep the datapoints from different sections on the main mindmap (eg COSD, Genomics), separated under similarly-named sections on the sub-mindmap.
- In the main mindmap (not the sub-mindmap), place a marker ( – aka “Modelling has started”) on each datapoint of the main mindmap, when it has been copied over to a sub-mindmap.
- When we believe we have copied across all the relevant items to a sub-mindmap, we can begin the analysis step.
Begin from the top of the sub-mindmap and examine each datapoint. Using the labels and notes we can identify the duplicate datapoints that should have also been transferred from the other sections.
If one of the duplicate datapoints has not been transferred to the sub-mindmap, copy it across at this point and make sure it is marked( – aka “Modelling has started”) on the main mindmap.We then need to remove all the duplicates from the sub-mindmap, so we are only dealing with unique datapoints.
It is good practice to nominate one section as your base dataset (eg COSD in cancer) – these datapoints will be preferentially preserved, and duplicates from other datasets removed. This will make it slightly easier to keep track of all the datapoints in the future.
- We should be aware that there may have been errors in data synchronisation during the previous steps of mindmapping. Therefore we are very careful when we delete duplicate datapoints. We should examine all the labels and notes of all the duplicate datapoints, and make sure all unique information is copied across to the datapoint we keep. This information (eg mappings to datapoints in different datasets, ordinality, issues to be resolved etc) is very useful for archetyping and future data mapping, so it is vital that it is not inadvertently lost.
The notes should include all the registry codes to which the datapoint maps, including the name of the parent datapoint. eg in the COSD section, the datapoint name itself will contain the COSD code, but we should ensure that the attached notes also list the COSD code, for completeness.
- Once we have processed a datapoint as above (consolidated information from, and then deleted the duplicate datapoints), we mark the datapoint with a blue flag ( )to show that we have completed the analysis phase.
But if there are any questions to resolve (eg regarding possible mappings, or perhaps even possible errors detected) we mark it with a red flag ( ).
- If we are unsure whether one of the duplicate fields is truly a duplicate, then do not delete it, and also mark it with a red flag ( ). Eg if there are 4 definitely identical datapoints, and 2 further possible (but not definite) duplicates:
– Keep one of the “definites” and the 2 “possibles”. Mark these with red flags.
– Delete the other 3 datapoints (the redundant “definites”).
- After you have formed your main sub-mindmaps, go back and check if all datapoints on the main mindmap have a marker. If not, they have not yet been processed.
If this unprocessed datapoint does not fit any of the categories of the existing sub-mindmaps, a new sub-mindmap may be needed to house these orphan datapoints (e.g. called “Miscellaneous”).
This approach has helped us to begin tackling the large datasets. However minor problems will undoubtedly arise, which will be documented below as they arise:
- Some datapoints from different registries refer to the same data, but at different time points (eg “PSA at diagnosis” vs “PSA at start of treatment”). We would expect our longitudinal data model to handle both datapoints, using the same archetype. We should treat these as duplicate datapoints, but this timing information should be recorded in the notes section, as it will be useful for future mappings. We retain only one of the datapoints, but mark it with a red flag ( ) to indicate that the mapping will not be straightforward.
- Some event metadata for items (eg “Event Date” in Genomics) can be difficult to map to other data points. We would expect this type of metadata to be handled by the reference model of openEHR, so we don’t dwell too long on this.