Statistical methods for assessing agreement between two methods of clinical measurement

https://doi.org/10.1016/j.ijnurstu.2009.10.001Get rights and content

Abstract

In clinical measurement comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old. Such investigations are often analysed inappropriately, notably by using correlation coefficients. The use of correlation is misleading. An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.

Introduction

CLINICIANS often wish to have data on, for example, cardiac stroke volume or blood pressure where direct measurement without adverse effects is difficult or impossible. The true values remain unknown. Instead indirect methods are used, and a new method has to be evaluated by comparison with an established technique rather than with the true quantity. If the new method agrees sufficiently well with the old, the old may be replaced. This is very different from calibration, where known quantities are measured by a new method and the result compared with the true value or with measurements made by a highly accurate method. When two methods are compared neither provides an unequivocally correct measurement, so we try to assess the degree of agreement. But how?

The correct statistical approach is not obvious. Many studies give the product–moment correlation coefficient (r) between the results of the two measurement methods as an indicator of agreement. It is no such thing. In a statistical journal we have proposed an alternative analysis (Altman and Bland, 1983), and clinical colleagues have suggested that we describe it for a medical readership.

Most of the analysis will be illustrated by a set of data (table) collected to compare two methods of measuring peak expiratory flow rate (PEFR).

Section snippets

Sample data

The sample comprised colleagues and family of J.M.B. chosen to give a wide range of PEFR but in no way representative of any defined population. Two measurements were made with a Wright peak flow meter and two with a mini Wright meter, in random order. All measurements were taken by J.M.B., using the same two instruments. (These data were collected to demonstrate the statistical method and provide no evidence on the comparability of these two instruments.) We did not repeat suspect readings and

Plotting data

The first step is to plot the data and draw the line of equality on which all points would lie if the two meters gave exactly the same reading every time (Fig. 1). This helps the eye in gauging the degree of agreement between measurements, though, as we shall show, another type of plot is more informative.

Inappropriate use of correlation coefficient

The second step is usually to calculate the correlation coefficient (r) between the two methods. For the data in Fig. 1, r = 0.94 (p < 0.001). The null hypothesis here is that the measurements by the two methods are not linearly related. The probability is very small and we can safely conclude that PEFR measurements by the mini and large meters are related. However, this high correlation does not mean that the two methods agree:

  • (1)

    r measures the strength of a relation between two variables, not the

Measuring agreement

It is most unlikely that different methods will agree exactly, by giving the identical result for all individuals. We want to know by how much the new method is likely to differ from the old; if this is not enough to cause problems in clinical interpretation we can replace the old method by the new or use the two interchangeably. If the two PEFR meters were unlikely to give readings which differed by more than, say, 10 l/min, we could replace the large meter by the mini meter because so small a

Precision of estimated limits of agreement

The limits of agreement are only estimates of the values which apply to the whole population. A second sample would give different limits. We might sometimes wish to use standard errors and confidence intervals to see how precise our estimates are, provided the differences follow a distribution which is approximately Normal. The standard error of d is (s2/n), where n is the sample size, and the standard error of d  2s and d + 2s is about (3s2/n). 95% confidence intervals can be calculated by

Example showing good agreement

Fig. 3 shows a comparison of oxygen saturation measured by an oxygen saturation monitor and by pulsed oximeter saturation, a new non-invasive technique (Tytler and Seeley, in press). Here the mean difference is 0.42 percentage points with 95% confidence interval 0.13–0.70. Thus pulsed oximeter saturation tends to give a lower reading by between 0.13 and 0.70. Despite this, the limits of agreement (−2.0 and 2.8) are small enough for us to be confident that the new method can be used in place of

Relation between difference and mean

In the preceding analysis it was assumed that the differences did not vary in any systematic way over the range of measurement. This may not be so. Fig. 4 compares the measurement of mean velocity of circumferential fibre shortening (VCF) by the long axis and short axis in M-mode echocardiography (D’Arbela et al., unpublished). The scatter of the differences increases as the VCF increases. We could ignore this but the limits of agreement would be wider apart than necessary for small VCF and

Repeatability

Repeatability is relevant to the study of method comparison because the repeatabilities of two methods of measurement limit the amount of agreement which is possible. If one method has poor repeatability—i.e., there is considerable variation in repeated measurements on the same subject—the agreement between the two methods is bound to be poor too. When the old method is the more variable one, even a new method which is perfect will not agree with it. If both methods have poor repeatability, the

Measuring agreement using repeated measurements

If we have repeated measurements by each of two methods on the same subjects we can calculate the mean for each method on each subject and use these pairs of means to compare the two methods using the analysis for assessing agreement described above. The estimate of bias will be unaffected, but the estimate of the standard deviation of the differences will be too small, because some of the effect of repeated measurement error has been removed. We can correct for this. Suppose we have two

Discussion

In the analysis of measurement method comparison data neither the correlation coefficient (as we show here) nor techniques such as regression analysis1 are appropriate. We suggest replacing these misleading analyses by a method that is simple both to do and to interpret. Further, the same method may be used to analyse the repeatability of a single measurement method or to compare measurements by two observers.

Why has a totally inappropriate method, the correlation coefficient, become almost

Conflict of interest

None.

References (8)

  • J.S. Gill et al.

    Relationship between initial blooc pressure and its fall with treatment

    Lancet

    (1985)
  • D.G. Altman et al.

    Measurement in medicine: the analysis of method comparison studies

    Statistician

    (1983)
  • P. Armitage

    Statistical methods in medical research

    (1971)
  • British Standards Institution, 1979. Precision of test methods I. Guide for the determination and reproducibility for a...
There are more references available in the full text version of this article.

Cited by (0)

This article was originally published in The Lancet 1986 327(8476) 307–310. The article is republished with permission from The Lancet.

View full text