Data, when responsibility unlocked, offers huge promise to tackle some of the most pressing issues within global development. However, the process of unlocking data is a dance between policy and technology. With complex considerations and nuances coming from both sides, I’ve found it’s essential we all have a clear — and collective — understanding of the key language and the ideas underlying it.
With that in mind, I’ve chosen to write about a few general data-related terms through a government data exchange lens, with policymakers particularly in mind. These fall into two different categories — each critical to fostering a shared understanding and advancing the promise of data to benefit all.
- The first are those that get used in conversations where we may have multiple definitions at play.
- The second deals with the different types of data that a data exchange system will need to deal with and as such, necessitating different policy positions and approaches.
I’ve steered clear of formal definitions, and instead, I’ve opted for (hopefully) something a little simpler and more explanatory. Let’s call it a basic primer.
So, let’s dive into some of the key data terms.
Data exchange — At the highest level, a data exchange system is a set of technologies, standards, and policies that facilitate the sharing of data between different parties. Importantly, a data exchange is a shared platform, meaning that it supports multiple data sources and data types.
By joining a data exchange (and with proper permission) those who want data can use the same method to request data from each of the available data sources, instead of having to develop independent practices for each. This makes for simpler system designs.
Consider the following example. A citizen’s address can be accessible via the exchange, and queried every time it is needed. This means that each government agency isn’t keeping their own copy of the address, thus avoiding duplicates. Duplicates in data are a problem, especially for data that can change. If the person moves to a new address, there is a high likelihood that not all copies of the address will get properly updated if they are stored on an agency-by-agency basis. If a change needs to happen, it only needs to happen once. It also means that the online forms that the citizen needs to fill out become simpler, as the system can simply pull and verify the citizen’s address, instead of making them enter it again. Wins all around.
(Un)structured data — Data can generally be put into one of two buckets: structured data and unstructured data.
Structured data is made up of records that are consistent in terms of what the data fields / attributes are and the value types. It works well in a spreadsheet, laid out in rows and columns. Unstructured data, on the other hand, lacks conformity to a consistent record format.
For example, in a library you have items like books, audio/visual materials, maps, and more (one library in a town where I used to live had a telescope available for check-out). In the library, there will also be a catalog system with records for each of the available items, which is used to track them. In this case, the catalog system contains structured data (i.e., the list of all the items) while the items themselves compose the library’s unstructured data (each book is unique).
Sometimes structured and unstructured data are equated to quantitative and qualitative data (respectively), but while related, they aren’t the same thing and should not be used as interchangeable terms.
Data model — From the data exchange point of view, the data model describes the structure of the data, the type of data that is contained, and the relationships between the different parts of the data. The data model can be used to describe both structured and unstructured data, as well as relationships between both types. The data model answers the question, “tell me about the data that I have?”
For example, the data model for a medical patient is going to include different record types, such as patient identity information, office visits, diagnoses, prescribed medication, medical procedures, test results, etc. It will define the details in these records, like the fact that a prescribed medication will include the medication name and dosage information, among other things. The model will also define the relationships between the different record types (e.g., that diagnoses will be attached to prescribed medication or medical procedures). You can think of this as a glorified mind map.
Schema — This is the formalized definition of structured data parts of a data model, which can be used by a computer. It gets into the pedantic details. So, while the data model can say that a person record has an age, which is a number, the schema is going to say that the ‘person’ record has an attribute named ‘age’ (all lowercase) which is going to be a whole number between 0 and 130 (putting a little padding in there, as the longest-lived person reached the ripe old age of 122).
Schemas are important for interoperability — the ability for two systems to be able to communicate and understand one another — and can be a source of issues. What is the correct name for our age field? Is it ‘age’? ‘alter’? ‘edad’? ‘umri’? 年? आयु? tuổi? It depends. What then happens when we get two different systems trying to share data that have made different naming choices? We can run into potential problems.
Metadata — Metadata is information that describes data, or in other words, data about data. It can provide information about what the data represents, when it was collected and by whom and when, licensing and copyright information, etc. A schema is also a form of metadata.
Metadata needs to be carefully considered in system design, as sometimes the metadata is just (or nearly) as valuable as the data itself, especially with unstructured data. For example, the metadata around a phone call includes:
- The number that placed the call
- The number that received the call
- When the call was made
- The duration of the call
In a criminal investigation, even if the data (the contents of the call itself) can’t be had by law enforcement, the metadata conveys a lot of information from which inferences can be made. Consequently, just as data needs protection, so does the metadata related to that data.
Protocols — When we as humans greet each other in certain circumstances, there are ritualized actions that have been developed over time. Consider the handshake, the bow, the embrace, the kissing of cheeks. We sometimes refer to these as protocols, as in “the proper protocol to greet a dignitary.”
For data, specifically the exchange of data, we have something similar. When a data requestor wishes to receive data from a data provider, there is a set process that must be followed to first contact the data provider, to prove who they are, to ask for the data, and to receive the data (as well as deal with any failures that might happen along the way with any of those steps). Like with our human greetings, this process is also known as a protocol.
Protocols vary, but what is important is agreement on the protocol that will be used between two or more systems. Some protocols gain such widespread adoption that they become standards. Things will not work if both parties aren’t using the same protocol. Interestingly, some protocols used for exchanging data mimic human interaction protocols in surprising ways. For example, SMTP — which is a protocol used to send email from your computer to the email server that will deliver it — starts with your email client connecting to the server and literally saying hello (abbreviated as “HELO” or “EHLO”). The protocol then requires the server to acknowledge that greeting.
This list presents some of the many key data terms in use today. There's more to explore.
There’s so much to delve into around data, its use, and ways to build systems that allow data to be leveraged to create better outcomes for people. By starting to unpack terms like these, we can begin to ensure a collective understanding of both the key language and the ideas underlying them.
Such topics also help reinforce the point that because data is multifaceted, we must promote a multifaceted approach to data policies. For example, the way that metadata is handled from a policy point of view won’t be the same as the data model. The same is true for level of specificity — should policy get down to the level of specifying protocols or should that be left up to the implementors? These are decisions that will need to be made.
With so many considerations, one thing is clear: better understanding leads to better outcomes.
In the future, I’ll be talking about other topics I find interesting, but would love to hear from you about areas in which you have particular interest. Please reach out with any questions or comments to info@dial.global.