What Are Data? A Categorization of the Data Sensitivity Spectrum☆,☆☆
Introduction
The definition of data might at first glance seem prosaic, but formulating a definitive and useful definition is surprisingly difficult. This question is important because of the protection given to data in law and ethics. Healthcare data are universally considered sensitive (and confidential), so it might seem that the categorisation of less sensitive data is relatively unimportant for medical data research. This paper will explore the arguments that this is not necessarily the case.
The terms data and information are sometimes used as synonyms and sometimes distinguished. Data protection legislation often does not distinguish the two concepts, except for using data to denote digitally stored information (although data protection laws may also protect non-digital data). The definition of data is surprisingly difficult. Communication (the transfer of data) has over 100 different definitions [1], [2]. There are many discipline-specific definitions of information. One generic definition states that:
Information has been more narrowly (and usefully) defined as “data that has been processed into a meaningful form” [4]. Other definitions includeinformation is produced by all processes and it is the value of characteristics in the processes' output that are information [3].
The definition of data acquires great importance in the area of data protection and privacy, as the issue of what are personal data determines which data are protected by law and are confidential. The Open Data movement makes the issue even more important [8]. Whilst the status of healthcare data as sensitive personal data is enshrined in law, there are many other types of data used in data linkage studies. Their ‘capacity’ to contribute to the reidentification of subjects increases their potential sensitivity. Although there has been a great deal of discussion over which data are in the sensitive category, there is little examination of the different levels of sensitivity within the personal data category [9]. The two main ethical and legal issues in data protection – autonomy and privacy are interrelated concepts as encapsulated in the concept of informational self-determination. These issues are common to all healthcare research, although the harms are lesser in data research. Although there are a number of narrower issues and rights, they can all be rooted in one of these two concepts. Autonomy is relatively easy to define – it is the ability of a person to make decisions and act upon these. Privacy is more complex and covers several distinct concepts. The protection of physical space or person is not relevant to data protection, except by analogy. There have been several attempts to provide comprehensive taxonomies [10], [11], and several seminal works on privacy in the USA, starting with Warren and Brandeis in 1890, and developing via Prosser, Westin, and Altman [12], [13], [14], [15]. The German concept of ‘informational self-determination’ (which is part of the right to development of the personality in Article 2 of the German constitution), that one has the right to decide what personal information should be communicated to others and under what circumstances (Westin's definition of privacy) [14], arguably covers all the issues relating to privacy and data.Information is data that has been processed into a form that is meaningful to the recipient [5].
Data is the raw material that is processed and refined to generate information [6].
Information equals data plus meaning [7].
Neither consent nor anonymisation is necessary nor sufficient in law or ethics for the use of personal data for research. Requiring consent or anonymisation will not guarantee protection of data subjects in all circumstances [16]. Neither of these rights are absolute, and there are provisions to override them in particular circumstances (however, an express refusal to research use must be respected unless there are exceptional circumstances). The opportunity to opt out of even anonymised data processing indicates that respect for privacy alone may be insufficient (although the right being protected is unclear – it may simply reflect a desire to maintain the social licence). Where consent is impossible or impracticable, governance mechanisms will permit processing of personal data if necessary and proportionate.
Do the public understand what data are, how their data are used, who controls it, and who will have access to it? The answer according to several studies is “in the negative” [17], [18], [19], [20], [21], [22]. In particular, it has been found that privacy controls can give false reassurance to users [23]. Further, it has been shown that intentions do not translate into action [24]. The term “personal data” covers a massive range of data from the totally trivial to the extremely intimate.
- 1.
British law has lagged behind Continental jurisdictions and the USA in the protection of privacy (although there is a common law duty of confidentiality), but now a right to privacy has emerged in the UK through case law (Douglas v Hello!), or based on Article 8 rights to a private and family life [26]. There are definite and well-recognized interests in protecting and keeping private personal data. It is possible that there are some data about a person that relate to nothing sufficiently important or personal and therefore no strong justification for them to be protected or kept private. However, there is a large amount of data about which their value and sensitivity depends both on context and the motivations and trustworthiness of the person accessing the data [27]. This is the justification for a wide definition of data for the purposes of legislation and regulation, but the use of a more detailed classification of different types of data could potentially increase the utility and reduce the risks of Big Data if it allowed a more nuanced definition of personal information [28]. This paper will examine whether more finely defining the categories of data in the context of research could enable more flexible and responsive approaches to privacy and autonomy. This is important for the maintenance of the social licence whilst maximising the utility of data for research projects [25].
Section snippets
Etymology of data
The term ‘datum’ (plural data) comes from Latin, meaning “a thing given”. This says something about the nature of data – that it has its value in transmission. This concept of value in transmission can also be related to the legal status of a database in property as a “thing in action” (as opposed to a “thing in possession” – see below).
Definition of Big Data
The term ‘Big Data’ has been appropriated to mean many different things [29], [30]. Collection of vast quantities of data have become economically feasible due to the massive decrease in the cost of digital storage [31] and data collection (due to the proliferation of smartphones and other devices) [32]. Big Data could be characterized as the value of vast amounts of data, which are of little if any value in small quantities. This can result in the tragedy of the anticommons, i.e. the inability
Data versus personal data versus information versus metadata
Returning to the opening question, what are data? What are the distinctions between data, personal data, and information? The terms ‘data’ and ‘information’ are often used interchangeably in legislation and regulations. In terms of public understanding, there is a useful differentiation to be made between statistics, data, and information. The term “statistics” better conveys the nature of aggregated anonymised data that cannot be traced to any individual.
Legally, personal data are defined in
Ownership of data
An issue related to privacy is the issue of data “ownership”. However, control over data pertaining to oneself is about more than privacy – it is informational self-determination. One model that has been developed to acknowledge data subjects rights is the Nordic MyData model [36]. Legal possession of a thing connotes the ability to exclude others from its possession or use. Legally, it is clear that data per se cannot be owned (Oxford v Moss, Your Response Ltd v Datateam Business Media Ltd).
Data in the public domain (including surveillance by CCTV)
Data that relate to what is publicly visible are not necessarily non-sensitive data. Hair, skin and eye colour can be observed by anyone (except communities where veiling of the face and/or eyes is commonplace). The counter-example is that a person may not wish to have certain characteristics disseminated to a wide audience. Race may be immediately apparent on the basis of skin colour, but again this would be widely seen as a protected characteristic. Certain characteristics are sensitive due
Context
The value of data often lies in linkage with other data, and thereby creating new data. The woman of fertile age purchasing a certain type of lotion and certain supplements could be surmised to be pregnant based on previous data analysis – so these simple purchases reveal a far more personal fact [44]. Aggregation of data produces new data, which the data subject may be uncomfortable with the data user knowing. This production of new knowledge is the aim of data linkage research and in this
Re-identification and temporal variations
Some data are proxies for the individual and readily recognized as potential re-identifiers – the registration mark of a car, for example. The tracking of the location of a vehicle has been deemed not an invasion of privacy that required a warrant in United States v Knotts on the grounds that this amounted to a following of the vehicle on the public highways. The combination of date of birth and postcode clearly allows easy identification of the individual. The combination of date of birth, sex
Data versus information
Data has several definitions, but the common theme is that data are more concrete and information is more abstract. Individually, data are rarely useful. A date alone may be – an appointment, a holiday or an anniversary. However, data are often used to mean specifically digitally stored quantified information, especially as the substrate for computerised processing. It usually refers to unprocessed data, although the term “raw data” reveals that this is not always the case.
Information is formed
Motivation and trustworthiness of user
Data have value. Millennials understand this, and their digital life reflects this acceptance of an exchange of value. Users of social media understand that their data secures free use of a valuable platform. However, they may not realise exactly how much of their data is being gathered, and who is using it [55], [56]. In particular, automated profiling may result in decisions being made about them with no input, awareness of the process, or ability to detect erroneous information.
Empirical
Categories of data
Some data has no connection to any natural person. The only protection for such data would relate to the interests of the controller e.g. proprietary interests in the intellectual property of a novel invention. The sensitivity of data categories may be reflected in legislation, but neither financial details and criminal convictions are classified as sensitive under the Data Protection Directive.
The UK Anonymisation Network (UKAN) classifies data into four types (see Table 1).
This classification
Sensitivity of data
Table 2 details the potential spectrum of sensitivity for particular subcategories of data, with explanations in the appendix. The numbers in the cells refer to the relative frequency with which data would fall into that part of the spectrum for that data category e.g. occupation would rarely fall into the most sensitive data category.
Sensitivity increases from 0–10 with colours from green to red used as a visual aid. Relative frequency with which data would occur in any part of the sensitivity
Conclusion
The expectations of privacy differ radically from person to person. It is impossible for any definition of personal data to encompass the expectations of the entire population. The law is interpreted to reflect the reasonable expectations of the public. Recent research in the UK has enabled greater insight into public attitudes towards the use of their healthcare data in different contexts. There is a need for more empirical work with different populations.
The law is not prescriptive about the
References (80)
- et al.
On moving targets and magic bullets: can the UK lead the way with responsible data linkage for health research?
Int. J. Med. Inform.
(2015) - et al.
The Icelandic genome debate
Trends Biotechnol.
(2001) - et al.
Legal rights in data
Comput. Law Secur. Rev.
(2011) The concept of communication
J. Commun.
(1970)Communication Theories in Action: An Introduction
(2000)A discipline independent definition of information
J. Am. Soc. Inf. Sci.
(1997)- (2003)
- et al.
Management Information Systems: Conceptual Foundations, Structure, and Development
(1985) - et al.
Systems Analysis and Design
(1989) - et al.
Soft Systems Methodology in Action
(1990)
Benefits, adoption barriers and myths of open data and open government
Inf. Syst. Manag.
Sharing health-related data: a privacy test?
Nat. Partn. J. Genom. Med.
A typology of privacy
Univ. Pa. J. Int. Law
A taxonomy of privacy
Univ. Pa. Law Rev.
The right to privacy
Harvard Law Rev.
Privacy
Calif. Law Rev.
Privacy and Freedom
The Environment and Social Behaviour: Privacy, Personal Space, Territory, and Crowding
A privacy paradox: social networking in the United States
First Monday
Facebook and online privacy: attitudes, behaviors, and unintended consequences
J. Comput.-Mediat. Commun.
To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles
Publicly private and privately public: social networking on YouTube
J. Comput.-Mediat. Commun.
Alan Westin's privacy homo economicus
Wake For. Law Rev.
Privacy and democracy in cyberspace
Vanderbilt Law Rev.
Misplaced confidences: privacy and the control paradox
Soc. Psychol. Pers. Sci.
The privacy paradox: personal information disclosure intentions versus behaviors
J. Consum. Aff.
The social licence for research: why care.data ran into trouble
J. Med. Ethics
Transforming breach of confidence? Towards a common law right of privacy under the Human Rights Act
Mod. Law Rev.
Big data for all: privacy and user control in the age of analytics
Northwest. J. Technol. Intellect. Prop.
7 definitions of Big Data you should know about
How should we do the history of Big Data
Big Data Soc.
A history of storage cost (update)
Big data: the management revolution
Harv. Bus. Rev.
Privacy, Confidentiality, and Health Research
Can patents deter innovation? The anticommons in biomedical research
Science
Ministry of Transport and Communications (Finland), MyData – a Nordic Model for human-centered personal data management and processing
Unpatient – why patients should own their medical data
Nat. Biotechnol.
From street photography to face recognition: distinguishing between the right to be seen and the right to be recognized
Nova Law Rev.
Fashion statement: designer creates line of drone-proof garments to protect privacy
Cited by (0)
- ☆
This article belongs to Big Data for Healthcare.
- ☆☆
Funding: This work was supported by a Horizon 2020 grant from the European Union ICT 2014/1.