Practical Issues in Sample Size Determination for Correlation Coefficient Inference

SM Journal of

Biometrics &

Biostatistics

Gr up

How to cite this article Looney SW. Practical Issues in Sample Size Determination

for Correlation Coefcient Inference. SM J Biometrics Biostat. 2018; 3(1): 1027.

OPEN ACCESS

ISSN: 2573-5470

Introduction

An important issue in planning any study that will require inference for a correlation coecient

is the determination of the appropriate sample size to use. e usual procedure of choosing n based

on the power of the test of the hypothesis that the population correlation is zero oen results in

correlations of little practical importance being declared "signicant," and condence intervals that

are too wide to be of any practical use. In this article, alternative methods for determining sample

size are presented and compared to the "usual" procedure.

Example

Suppose that one is planning a study involving bivariate normal data in which statistical

inference is to be performed for a Pearson Correlation Coecient (PCC) and that the consensus of

previous research in the area is that the population correlation is no smaller than 0.40. Reference

to sample size tables for the "usual" t-test of the correlation coecient [1] indicates that a sample of

n= 46 will yield 80% power for detecting departures from zero as small as ρ = 0.40 when α = 0.05

(Table 1).

For the sake of argument, suppose that the value of the sample PCC (denoted here aer by r)

from a subsequent sample of 46 is exactly equal to 0.40. is yields a 2-tailed p-value of 0.006 and a

95% condence interval of (0.12, 0.62). Although these results indicate statistical signicance, their

practical signicance is unclear because the condence interval is too wide to draw any reasonable

conclusion about the true magnitude of ρ. For example, Hebel and McCarter [2] classify 0.0 ≤ |ρ |

≤ 0.2 as "negligible," 0.2 < | ρ | < 0.5 as "weak," 0.5 ≤ | ρ | ≤ 0.8 as "moderate," and 0.8 < | ρ | ≤ 1.0 as

"strong." us, using their classication scheme, all we can conclude from a condence interval of

(0.12, 0.62) is that ρ is somewhere between "negligible" and "moderate" (inclusive). If one prefers to

interpret correlation coecients in terms of eect size, Cohen [1] suggests that one classify | ρ | = 0.1

as a "small" eect size, | ρ | = 0.3 as "medium," and | ρ | = 0.5 as "large." Using this scheme, all that a

condence interval of (0.12, 0.62) tells us is that the eect size of | ρ | is somewhere between "small"

and "large" (inclusive).

One of the alternative approaches proposed in this article is to select n on the basis of the desired

width of the resulting Condence Interval (C.I.) for ρ rather than the power of the test of H

: ρ= 0.

For the aforementioned example, Table 2 indicates that a sample size of n = 273 is required to yield

a 95% C.I. of width 0.20 using a "planning value" of r = 0.40. Assuming that a value of exactly r =

0.40 is obtained from a subsequent sample of 273, the resulting p-value is <0.001 and the 95% C.I.

is (0.30, 0.50). While this result also indicates statistical signicance, the C.I. is suciently narrow

to indicate that the population correlation between the two variables is "weak" according to the

classication scheme of Hebel and McCarter.

Background

e diculty described in the previous section arises primarily from the fact that H

: ρ= 0 is

not the appropriate null hypothesis to test in most situations that require inference for a single

Research Article

Practical Issues in Sample Size

Determination for Correlation

Coecient Inference

Stephen W Looney*

Department of Population Health Sciences, Augusta University, USA

Article Information

Received date: Mar 12, 2018

Accepted date: Mar 19, 2018

Published date: Mar 22, 2018

*Corresponding author

Stephen W Looney, Department of

Population Health Sciences, Augusta

University, USA, Tel: (706) 721 4846;

Email: [email protected]

Distributed under Creative Commons

CC-BY 4.0

Keywords Condence Interval; Fisher

Z-Transform; Hypothesis Test; Interval

Width; Kendall Coefcient; Pearson

Correlation; Spearman Correlation;

Statistical Power

Abstract

Determination of the appropriate sample size to use when performing inference for a single Pearson

correlation coefcient ρ is usually based on achieving sufcient power for the test of H

: ρ = 0. However, sample

sizes found using this method can yield condence intervals that are so wide that they provide very little useful

information about the magnitude of the population correlation. Alternative approaches for determining the

appropriate sample size are proposed and compared to the "usual" method.

Citation: Looney SW. Practical Issues in Sample Size Determination for

Correlation Coefcient Inference. SM J Biometrics Biostat. 2018; 3(1): 1027.

Page 2/4

Gr up

correlation coecient. It is usually of little interest to determine if

there is sucient evidence to conclude that ρ = 0. (An exception would

be a study in which the primary null hypothesis is that variables X and

Y are independent and X and Y can be assumed to have a bivariate

normal distribution.) Most investigators would not proceed with a

study unless they had sucient reason to believe that the population

correlation is non-zero, even though no formal statistical hypothesis

test had ever been performed. Other authors agree with our assertion:

Strike [4] argues that the test of ρ = 0 is "utterly redundant" and

Shoukri [5] asserts that a test of H

: ρ = 0 is "meaningless."

Another problem with testing H

: ρ = 0 is that the usual t-test

oen rejects the null hypothesis for small values of the sample

correlation, even when the sample size is small to moderate (Table

3). For example, when n = 30, a sample value of r = 0.361 ("weak,"

according to [2]) yields p = 0.05 (but note the extremely wide C.I. of

(0.001, 0.638)). For a sample of n = 100, a correlation of only 0.197

yields p = 0.05. is correlation is "negligible" according to [2] and

would be classied as a small eect size by Cohen [1]. Since the

width of a C.I. for ρ varies inversely with both the sample size and

the magnitude of r, the combination of small r and small to moderate

n oen yields C.I.'s that are too wide to be of any practical use, even

though p < 0.05.

A further concern is that the sample sizes required to yield 80%

or 90% power for testing H

: ρ = 0 are generally too small to yield

C.I.'s of a usable width, even when the sample correlation is large

(Table 1). For example, a sample of only n = 6 is required to achieve

80% power against the alternative value ρ

= 0.90. Assuming that the

value of r from a subsequent sample of n = 6 is exactly equal to 0.9

yields p = 0.015 and a 95% C.I. of (0.33, 0.99) (Table 1), which merely

indicates that ρ is somewhere between "weak" and "strong" (inclusive)

according to [2]. If Cohen's Classication based on eect size is used,

this interval only tells us that the eect size of ρ is at least "medium"

in magnitude [1].

Alternative Approaches

One alternative to testing H

: ρ = 0 is to specify another null value.

Sometimes one is interested primarily in determining if the sample

results are consistent with some relevant non-zero hypothesized

value; for example, the smallest value of the correlation that would

be considered to be clinically meaningful. Such a value may be

determined from examining previously published research in the

area, from published guidelines or recommendations, from the

clinical judgment and expertise of the research team, etc. Applying

the Fisher z-transform to r,

yields a new random variable that has an approximate normal

distribution with mean 0 and variance 1/(n - 3). is result can be

used to derive a test statistic for testing H

: ρ= ρ

(1)

Table 3: Condence intervals corresponding to minimum value of r required to

yield p ≤ 0.05 when testing H

: ρ= 0 (2-tailed test).

Sample Size Minimum r Yielding p ≤ 0.05 95% C.I.(ρ) Width of C.I./2

10 0.632 (0.004, 0.903) 0.45

20 0.444 (0.002, 0.741) 0.37

30 0.361 (0.001, 0.638) 0.319

40 0.312 (0.001, 0.568) 0.284

50 0.279 (0.001, 0.517) 0.259

60 0.255 (0.001, 0.478) 0.239

70 0.235 (0.000, 0.445) 0.223

80 0.22 (0.000, 0.419) 0.21

90 0.208 (0.001, 0.398) 0.199

100 0.197 (0.001, 0.379) 0.19

120 0.18 (0.001, 0.348) 0.174

150 0.161 (0.001, 0.313) 0.157

180 0.147 (0.001, 0.287) 0.144

200 0.139 (0.000, 0.272) 0.136

(

n 3

z r)- z( )

−

( )

1 1 r

log ,

2 1 r

z r

−

Table 1: Sample size required to achieve specied power (1 -β) for detecting ρ =

ρ1 when testing H

: ρ=0 using α = 0.05 (2-tailed test).

ρ1

Sample Size 95% CI if r=ρ1 2-tailed ρ-value if r=ρ1

β=0.10 β=0.20 β=0.10 β=0.20 β=0.10 β=0.20

0.1 1047 783 (0.04,0.16) (0.03,0.17) 0.001 0.005

0.2 259 194 (0.08,0.31) (0.06,0.33) 0.001 0.005

0.3 113 85 (0.12,0.46) (0.09,0.48) 0.001 0.005

0.4 62 46 (0.17,0.59) (0.12,0.62) 0.001 0.006

0.5 37 28 (0.21,0.71) (0.16,0.74) 0.002 0.007

0.6 24 18 (0.26,0.81) (0.18,0.83) 0.002 0.009

0.7 16 12 (0.31,0.89) (0.21,0.91) 0.003 0.011

0.8 11 9 (0.38,0.95) (0.29,0.96) 0.003 0.01

0.9 7 6 (0.46,0.99) (0.33,0.99) 0.006 0.015

The sample sizes in this table were obtained from Cohen [1].

Table 2: Minimum Sample Size Required to Yield a 95% C.I.(ρ) of Specied

Width

Width of 95% CI(r)

0.1 0.2 0.3 0.4

0.1 1507 378† 168† 95†

0.2 1417 355 159 90†

0.3 1274 320 143 81

0.4 1086 273 123 70

0.5 867 219 99 57

0.6 633 161 74 43

0.7 404 105 49 30

0.8 205 56 28 18

0.9 63 21 14 11

*e sample sizes in this table were obtained using the methods

of Bonett and Wright [3].

†Condence intervals based on these values of n and r are

guaranteed to contain ρ = 0, resulting in a failure to reject H

: ρ= 0.

Citation: Looney SW. Practical Issues in Sample Size Determination for

Correlation Coefcient Inference. SM J Biometrics Biostat. 2018; 3(1): 1027.

Page 3/4

Gr up

where n is the sample size, z(r) is the Fisher z-transform applied

to the sample value of the PCC, and z (ρ

) is the Fisher z-transform

applied to the hypothesized value of the PCC, which can be any ρ

such that | ρ

| < 1. (Most commonly, ρ

= 0, in which case z (ρ

) =z

(0) =0.) e value of z

in (1) is then evaluated against the standard

normal distribution to obtain an approximate p-value.

In some instances, there may be no non-zero null value ρ

that is

of primary interest. In this case, one could use the cutos advocated

by Hebel and McCarter [2] (or other cutos that make sense in the

context of the applied problem) to get a sense of the magnitude of

the correlation in the population under study. For example, if it is

known that ρ is positive, then one could test H

: ρ≤ 0.8 to determine

if the population correlation is "strong" or H

: ρ ≤ 0.2 to determine

if the population correlation is "non-negligible" using the Hebel and

McCarter criteria. Tables similar to Table 1 could then be constructed

for these values of ρ

. Alternatively, using the same notation as in

Table 1, the following formula could be used for a 1-tailed test:

(2)

where z

denotes the upper

-percentage point of the standard

normal and z(ρ) denotes the Fisher z-transform of ρ.

Consider the example discussed previously in which one wishes

to determine the appropriate sample size to use for a future study in

which the PCC is of primary interest and there is reason to believe

that ρ is positive and no smaller than 0.40. One approach would be

to test H

: ρ≤0.2; if this hypothesis is rejected, then one can conclude

that the population correlation is non-negligible according to [2]. A

calculation using Equation (2) indicates that samples of 130 and 179

would be required to achieve 80% and 90% power, respectively, to

detect ρ

= 0.40 using an upper-tailed test.

Suppose that a subsequent sample of n = 130 yielded a sample

value of exactly r = 0.40; the corresponding upper-tailed p-value for

the test of H

: ρ ≤ 0.2 is 0.006 and the one-sided 95% C.I. is (0.27, 1.00).

us, one can conclude that the population correlation is signicantly

greater than 0.2 (p < 0.001) and can be classied as "non-negligible”

according to [2]. In the example in which n = 30 and r = 0.361, the

p-value for the test of H

: ρ < 0.2 is 0.181, insucient evidence to

conclude that the population correlation is "non-negligible," despite

the fact that the test of H

: ρ=0 indicated that the result is “signicant.”

Another alternative is to focus one's attention on condence

interval estimation of the population correlation (derived using the

Fisher z-transform of r) instead of the test of a particular hypothesized

null value. is approach is consistent with the emphasis placed

on condence interval estimation over hypothesis testing by many

authors [6-8] Table 2 can be used to determine the sample size

required to obtain a 95% C.I. for ρ of a desired width. (is approach

was illustrated in a previous section).

Discussion

e purpose of this article is to illustrate some of the practical

problems encountered when attempting to determine the appropriate

sample size to use when a proposed study requires inference for a

single correlation coecient. e argument is made that H

: ρ= 0

is usually not the appropriate null hypothesis to test and that using

sample sizes that yield a desirable level of power (say, 80% or 90%)

for this test can result in C.I.'s that are so wide that they provide

very little useful information about the magnitude of the population

correlation. Two alternative approaches were proposed: (1) testing

null values other than ρ

= 0 and (2) determining the sample size so as

to achieve a certain level of precision of the estimate of ρ, as measured

by the width of the resulting C.I. Depending on the purpose of the

statistical analysis, either or both of these approaches could be a

useful alternative to the "usual" method of determining sample size.

However, it must be noted that the sample sizes required for either

of these approaches oen will be much larger than those required

to achieve acceptable power when testing H

: ρ= 0 . In the example

considered in a previous section, the “usual” approach based on

testing H

: ρ= 0 yielded a sample of size n = 46, the approach based

on testing H

: ρ ≤ 0.2 yielded n = 163 and the “C.I.” approach yielded

n = 273.

e alternative approaches described in this article could also be

applied if one were performing inference for the Spearman Correlation

Coecient (SCC) or the Kendall Coecient of Concordance (KCC).

For example, using the same notation as in Equation (2) for a one-

tailed test of the KCC, the "z-transform" developed by Fieller, Hartley,

and Pearson [9] for the KCC yields the sample size formula.

Where τb

and τb

are the null and alternative values of the KCC,

respectively (τb1 > τb0). For the SCC, the Fieller, Hartley and Pearson

z-transform yields.

(3)

where ρs

and ρs

are the null and alternative hypothesized values

of the SCC, respectively (ρs

>ρs

). Using the improved z-transform

of the SCC proposed by Bonett and Wright [3], the formula in (3)

becomes

(4)

Bonett and Wright recommend that (4) be used for | ρ

| < 0.95

and that (3) be used if | ρ

| ≥0.95.

We are not necessarily advocating the use of the Hebel and

McCarter criteria [2] for interpreting the magnitude of correlation

coecients. While we have found these to be useful in our own

exploratory analyses of biomedical data, they may not be appropriate

in other areas of investigation. However, we encourage the

development and use of such guidelines because we feel that they can

greatly enhance one's ability to interpret and communicate results

to non-statisticians (e.g., see the guidelines proposed by Landis and

Koch [10] for interpreting agreement coecients. And the guidelines

proposed by Fleiss et al. [11] for interpreting intra-class correlation

coecients).

( ) ( )

z z

4 0.437

z z

α β

 

= +

 

τ − τ

 

b1 b0

( ) ( )

s1 s0

z z

3 1.06 ,

z z

ρ ρ

α β

 

= +

 

−

 

( ) ( )

s1 s0

z z

3 1 .

2 z z

ρ ρ

α β

 

= + +

 

−

 

1 0

3 ,

( ) ( )

z z

α β

ρ ρ

 

= +

 

−

 

Citation: Looney SW. Practical Issues in Sample Size Determination for

Correlation Coefcient Inference. SM J Biometrics Biostat. 2018; 3(1): 1027.

Page 4/4

Gr up

Simply reporting that a correlation is "signicant" just because

the p-value for the test of H

: ρ= 0 is less than 0.05 is generally not

sucient.

Acknowledgement

Some of this work was completed while the rst author was Acting

Senior Research Fellow in Epidemiology, Department of Medicines

Management and Visiting Professor, Department of Mathematics,

Keele University, Staordshire, England. is research was partially

supported by Wellcome Research Travel Grant #0816.

References

1. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Hillsdale,

NJ: Lawrence Erlbaum Associates. 1988; 79-80,101.

2. Hebel JR, McCarter RJ. A Study Guide to Epidemiology and Biostatistics.

Burlington, MA: Jones and Bartlett Learning. 2012; 90.

3. Bonett DG, Wright TA. Sample size requirements for estimating Pearson,

Kendall, and Spearman correlations. Psychometrika. 2000; 65: 23-28.

4. Strike PW. Assay method comparison studies, in Measurement in Laboratory

Medicine: A Primer on Control and Interpretation, Oxford, UK: Butterworth-

Heinemann. 1996; 170.

5. Shoukri MM. Measures of Interobserver Agreement and Reliability. Boca

Raton, FL: CRC Press 2011; 92.

6. Gardner MJ, Altman DG. Condence intervals rather than P values:

Estimation rather than hypothesis testing. British Medical Journal 1986; 292:

746-750.

7. Hahn GJ, Meeker WQ. Statistical Intervals. New York: John Wiley & Sons.

1991; 39-40.

8. Schmidt F. Statistical signicance testing and cumulative knowledge in

psychology: Implications for training of researchers. Psychological Methods.

1996; 1: 115-119.

9. Fieller EC, Hartley HO, Pearson ES. Tests for rank correlation coefcients. I.

Biometrika 1957; 44: 470-481.

10. Landis JR, Koch GG. The measurement of observer agreement for categorical

data. Biometrics. 1975; 33: 159-174.

11. Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions,

Hoboken, NJ: John Wiley & Sons. 2003; 604.