Elsevier

Big Data Research

Volume 12, July 2018, Pages 49-59
Big Data Research

What Are Data? A Categorization of the Data Sensitivity Spectrum,☆☆

https://doi.org/10.1016/j.bdr.2017.11.001Get rights and content

Abstract

The definition of data might at first glance seem prosaic, but formulating a definitive and useful definition is surprisingly difficult. This question is important because of the protection given to data in law and ethics. Healthcare data are universally considered sensitive (and confidential), so it might seem that the categorisation of less sensitive data is relatively unimportant for medical data research. This paper will explore the arguments that this is not necessarily the case and the relevance of recognizing this.

The categorization of data and information requires re-evaluation in the age of Big Data in order to ensure that the appropriate protections are given to different types of data. The aggregation of large amounts of data requires an assessment of the harms and benefits that pertain to large datasets linked together, rather than simply assessing each datum or dataset in isolation. Big Data produce new data via inferences, and this must be recognized in ethical assessments. We propose a schema for a granular assessment of data categories. The use of schemata such as this will assist decision-making by providing research ethics committees and information governance bodies with guidance about the relative sensitivities of data. This will ensure that appropriate and proportionate safeguards are provided for data research subjects and reduce inconsistency in decision making.

Introduction

The definition of data might at first glance seem prosaic, but formulating a definitive and useful definition is surprisingly difficult. This question is important because of the protection given to data in law and ethics. Healthcare data are universally considered sensitive (and confidential), so it might seem that the categorisation of less sensitive data is relatively unimportant for medical data research. This paper will explore the arguments that this is not necessarily the case.

The terms data and information are sometimes used as synonyms and sometimes distinguished. Data protection legislation often does not distinguish the two concepts, except for using data to denote digitally stored information (although data protection laws may also protect non-digital data). The definition of data is surprisingly difficult. Communication (the transfer of data) has over 100 different definitions [1], [2]. There are many discipline-specific definitions of information. One generic definition states that:

information is produced by all processes and it is the value of characteristics in the processes' output that are information [3].

Information has been more narrowly (and usefully) defined as “data that has been processed into a meaningful form” [4]. Other definitions include

Information is data that has been processed into a form that is meaningful to the recipient [5].

Data is the raw material that is processed and refined to generate information [6].

Information equals data plus meaning [7].

The definition of data acquires great importance in the area of data protection and privacy, as the issue of what are personal data determines which data are protected by law and are confidential. The Open Data movement makes the issue even more important [8]. Whilst the status of healthcare data as sensitive personal data is enshrined in law, there are many other types of data used in data linkage studies. Their ‘capacity’ to contribute to the reidentification of subjects increases their potential sensitivity. Although there has been a great deal of discussion over which data are in the sensitive category, there is little examination of the different levels of sensitivity within the personal data category [9]. The two main ethical and legal issues in data protection – autonomy and privacy are interrelated concepts as encapsulated in the concept of informational self-determination. These issues are common to all healthcare research, although the harms are lesser in data research. Although there are a number of narrower issues and rights, they can all be rooted in one of these two concepts. Autonomy is relatively easy to define – it is the ability of a person to make decisions and act upon these. Privacy is more complex and covers several distinct concepts. The protection of physical space or person is not relevant to data protection, except by analogy. There have been several attempts to provide comprehensive taxonomies [10], [11], and several seminal works on privacy in the USA, starting with Warren and Brandeis in 1890, and developing via Prosser, Westin, and Altman [12], [13], [14], [15]. The German concept of ‘informational self-determination’ (which is part of the right to development of the personality in Article 2 of the German constitution), that one has the right to decide what personal information should be communicated to others and under what circumstances (Westin's definition of privacy) [14], arguably covers all the issues relating to privacy and data.

Neither consent nor anonymisation is necessary nor sufficient in law or ethics for the use of personal data for research. Requiring consent or anonymisation will not guarantee protection of data subjects in all circumstances [16]. Neither of these rights are absolute, and there are provisions to override them in particular circumstances (however, an express refusal to research use must be respected unless there are exceptional circumstances). The opportunity to opt out of even anonymised data processing indicates that respect for privacy alone may be insufficient (although the right being protected is unclear – it may simply reflect a desire to maintain the social licence). Where consent is impossible or impracticable, governance mechanisms will permit processing of personal data if necessary and proportionate.

Do the public understand what data are, how their data are used, who controls it, and who will have access to it? The answer according to several studies is “in the negative” [17], [18], [19], [20], [21], [22]. In particular, it has been found that privacy controls can give false reassurance to users [23]. Further, it has been shown that intentions do not translate into action [24]. The term “personal data” covers a massive range of data from the totally trivial to the extremely intimate.

  • 1.

    British law has lagged behind Continental jurisdictions and the USA in the protection of privacy (although there is a common law duty of confidentiality), but now a right to privacy has emerged in the UK through case law (Douglas v Hello!), or based on Article 8 rights to a private and family life [26]. There are definite and well-recognized interests in protecting and keeping private personal data. It is possible that there are some data about a person that relate to nothing sufficiently important or personal and therefore no strong justification for them to be protected or kept private. However, there is a large amount of data about which their value and sensitivity depends both on context and the motivations and trustworthiness of the person accessing the data [27]. This is the justification for a wide definition of data for the purposes of legislation and regulation, but the use of a more detailed classification of different types of data could potentially increase the utility and reduce the risks of Big Data if it allowed a more nuanced definition of personal information [28]. This paper will examine whether more finely defining the categories of data in the context of research could enable more flexible and responsive approaches to privacy and autonomy. This is important for the maintenance of the social licence whilst maximising the utility of data for research projects [25].

Section snippets

Etymology of data

The term ‘datum’ (plural data) comes from Latin, meaning “a thing given”. This says something about the nature of data – that it has its value in transmission. This concept of value in transmission can also be related to the legal status of a database in property as a “thing in action” (as opposed to a “thing in possession” – see below).

Definition of Big Data

The term ‘Big Data’ has been appropriated to mean many different things [29], [30]. Collection of vast quantities of data have become economically feasible due to the massive decrease in the cost of digital storage [31] and data collection (due to the proliferation of smartphones and other devices) [32]. Big Data could be characterized as the value of vast amounts of data, which are of little if any value in small quantities. This can result in the tragedy of the anticommons, i.e. the inability

Data versus personal data versus information versus metadata

Returning to the opening question, what are data? What are the distinctions between data, personal data, and information? The terms ‘data’ and ‘information’ are often used interchangeably in legislation and regulations. In terms of public understanding, there is a useful differentiation to be made between statistics, data, and information. The term “statistics” better conveys the nature of aggregated anonymised data that cannot be traced to any individual.

Legally, personal data are defined in

Ownership of data

An issue related to privacy is the issue of data “ownership”. However, control over data pertaining to oneself is about more than privacy – it is informational self-determination. One model that has been developed to acknowledge data subjects rights is the Nordic MyData model [36]. Legal possession of a thing connotes the ability to exclude others from its possession or use. Legally, it is clear that data per se cannot be owned (Oxford v Moss, Your Response Ltd v Datateam Business Media Ltd).

Data in the public domain (including surveillance by CCTV)

Data that relate to what is publicly visible are not necessarily non-sensitive data. Hair, skin and eye colour can be observed by anyone (except communities where veiling of the face and/or eyes is commonplace). The counter-example is that a person may not wish to have certain characteristics disseminated to a wide audience. Race may be immediately apparent on the basis of skin colour, but again this would be widely seen as a protected characteristic. Certain characteristics are sensitive due

Context

The value of data often lies in linkage with other data, and thereby creating new data. The woman of fertile age purchasing a certain type of lotion and certain supplements could be surmised to be pregnant based on previous data analysis – so these simple purchases reveal a far more personal fact [44]. Aggregation of data produces new data, which the data subject may be uncomfortable with the data user knowing. This production of new knowledge is the aim of data linkage research and in this

Re-identification and temporal variations

Some data are proxies for the individual and readily recognized as potential re-identifiers – the registration mark of a car, for example. The tracking of the location of a vehicle has been deemed not an invasion of privacy that required a warrant in United States v Knotts on the grounds that this amounted to a following of the vehicle on the public highways. The combination of date of birth and postcode clearly allows easy identification of the individual. The combination of date of birth, sex

Data versus information

Data has several definitions, but the common theme is that data are more concrete and information is more abstract. Individually, data are rarely useful. A date alone may be – an appointment, a holiday or an anniversary. However, data are often used to mean specifically digitally stored quantified information, especially as the substrate for computerised processing. It usually refers to unprocessed data, although the term “raw data” reveals that this is not always the case.

Information is formed

Motivation and trustworthiness of user

Data have value. Millennials understand this, and their digital life reflects this acceptance of an exchange of value. Users of social media understand that their data secures free use of a valuable platform. However, they may not realise exactly how much of their data is being gathered, and who is using it [55], [56]. In particular, automated profiling may result in decisions being made about them with no input, awareness of the process, or ability to detect erroneous information.

Empirical

Categories of data

Some data has no connection to any natural person. The only protection for such data would relate to the interests of the controller e.g. proprietary interests in the intellectual property of a novel invention. The sensitivity of data categories may be reflected in legislation, but neither financial details and criminal convictions are classified as sensitive under the Data Protection Directive.

The UK Anonymisation Network (UKAN) classifies data into four types (see Table 1).

This classification

Sensitivity of data

Table 2 details the potential spectrum of sensitivity for particular subcategories of data, with explanations in the appendix. The numbers in the cells refer to the relative frequency with which data would fall into that part of the spectrum for that data category e.g. occupation would rarely fall into the most sensitive data category.

Sensitivity increases from 0–10 with colours from green to red used as a visual aid. Relative frequency with which data would occur in any part of the sensitivity

Conclusion

The expectations of privacy differ radically from person to person. It is impossible for any definition of personal data to encompass the expectations of the entire population. The law is interpreted to reflect the reasonable expectations of the public. Recent research in the UK has enabled greater insight into public attitudes towards the use of their healthcare data in different contexts. There is a need for more empirical work with different populations.

The law is not prescriptive about the

References (80)

  • M. Janssen et al.

    Benefits, adoption barriers and myths of open data and open government

    Inf. Syst. Manag.

    (2012)
  • S.O. Dyke et al.

    Sharing health-related data: a privacy test?

    Nat. Partn. J. Genom. Med.

    (2016)
  • B. Koops et al.

    A typology of privacy

    Univ. Pa. J. Int. Law

    (2016)
  • D.J. Solove

    A taxonomy of privacy

    Univ. Pa. Law Rev.

    (2006)
  • S.D. Warren et al.

    The right to privacy

    Harvard Law Rev.

    (15 Dec. 1890)
  • W.L. Prosser

    Privacy

    Calif. Law Rev.

    (1960)
  • A. Westin

    Privacy and Freedom

    (1970)
  • I. Altman

    The Environment and Social Behaviour: Privacy, Personal Space, Territory, and Crowding

    (1975)
  • S.B. Barnes

    A privacy paradox: social networking in the United States

    First Monday

    (2006)
  • B. Debatin et al.

    Facebook and online privacy: attitudes, behaviors, and unintended consequences

    J. Comput.-Mediat. Commun.

    (2009)
  • E. Zheleva et al.

    To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles

  • P.G. Lange

    Publicly private and privately public: social networking on YouTube

    J. Comput.-Mediat. Commun.

    (2007)
  • C.J. Hoofnagle et al.

    Alan Westin's privacy homo economicus

    Wake For. Law Rev.

    (2014)
  • P.M. Schwartz

    Privacy and democracy in cyberspace

    Vanderbilt Law Rev.

    (1999)
  • L. Brandimarte et al.

    Misplaced confidences: privacy and the control paradox

    Soc. Psychol. Pers. Sci.

    (2013)
  • P.A. Norberg et al.

    The privacy paradox: personal information disclosure intentions versus behaviors

    J. Consum. Aff.

    (2007)
  • P. Carter et al.

    The social licence for research: why care.data ran into trouble

    J. Med. Ethics

    (2015)
  • G. Phillipson

    Transforming breach of confidence? Towards a common law right of privacy under the Human Rights Act

    Mod. Law Rev.

    (2003)
  • M. Elliott, E. Mackey, K. O'Hara, C. Tudor, The Anonymisation Decision-Making Framework, UK Anonymisation Network,...
  • O. Tene et al.

    Big data for all: privacy and user control in the age of analytics

    Northwest. J. Technol. Intellect. Prop.

    (2013)
  • T. Elliott

    7 definitions of Big Data you should know about

  • D. Beer

    How should we do the history of Big Data

    Big Data Soc.

    (2016)
  • M. Komorowski

    A history of storage cost (update)

  • A. McAfee et al.

    Big data: the management revolution

    Harv. Bus. Rev.

    (2012)
  • W.W. Lowrance

    Privacy, Confidentiality, and Health Research

    (2012)
  • M.A. Heller et al.

    Can patents deter innovation? The anticommons in biomedical research

    Science

    (1998)
  • Ministry of Transport and Communications (Finland), MyData – a Nordic Model for human-centered personal data management and processing

  • L.J. Kish et al.

    Unpatient – why patients should own their medical data

    Nat. Biotechnol.

    (2015)
  • C. Cuador

    From street photography to face recognition: distinguishing between the right to be seen and the right to be recognized

    Nova Law Rev.

    (2016)
  • J. Nash

    Fashion statement: designer creates line of drone-proof garments to protect privacy

  • Cited by (0)

    This article belongs to Big Data for Healthcare.

    ☆☆

    Funding: This work was supported by a Horizon 2020 grant from the European Union ICT 2014/1.

    View full text