Applied Quantitative Methods MBA course Montenegro Peter Balogh PhD firstname.lastname@example.org 6. Measures of dispersion • In the previous part of the presentation we considered several measures of the typical, or average value. • The mean is widely regarded as the most important descriptive statistic. • When references are made to the average time or the average weight or the average cost it is generally the mean that has been calculated. • Knowledge of the mean, the median and the mode will increase our understanding of the data but will not provide a sufficient understanding of the differences in the data. 6. Measures of dispersion • In many applications it is the differences that are of particular interest to us. • In market research, for example, we are interested not only in the typical values but also in whether opinions or behaviours are fairly consistent or vary considerably. • A niche market is defined by difference. • Quality control, whether in the manufacturing or the service sector, is concerned with difference from the expected. 6. Measures of dispersion • In this part I introduce ways of measuring this variability, or dispersion, and then consider ways of comparing different distributions. • Measures of dispersion can be absolute (considering only one set of data at a time and giving an answer in the original units e.g. £'s, minutes, years), or relative (giving the answer as a percentage or proportion and allowing direct comparison between distributions). 6.1 The standard deviation • The standard deviation is the most widely used measure of dispersion, since it is directly related to the mean. • If you choose the mean as the most appropriate measure of central location, then the standard deviation would be the natural choice for a measure of dispersion. • Unlike the mean, the standard deviation is not so well known and does not have the same intuitive meaning. • The standard deviation measures differences from the mean - a larger value indicating a larger measure of overall variation. • The standard deviation will also be in the same units as the mean (£‘s, minutes, years) and a change of units (e.g. from £’s to dollars, or metres to centimetres) will change the value. 6.1 The standard deviation • The application of computer packages will generally make the determination of the standard deviation a relatively straightforward procedure, but it is worth checking what version of the formula is being used (the divisor can be n or n - 1). • I will continue to follow the practice of showing the calculations by hand, as you may still need to do them. • Such calculations do have the additional advantage of showing how the standard deviation is related to the mean. • The standard deviation is particularly important in the development of statistical theory, since most statistical theory is based on distributions described by their mean and standard deviation. 6. 1.1 Untabulated data • We have already seen how to calculate the mean from simple data. • We will need this calculation of the mean before we calculate the standard deviation. • We can again use the first 10 observations on the number of cars entering a car park in 10-minute intervals: 10 22 31 9 24 27 29 9 23 12 • The mean of this data is 19.6 cars. • The differences about the mean are shown diagrammatically in Figure 6.2. • To the left of the mean the differences are negative and to the right of the mean the differences are positive. • It can be seen, for example, that the observation 9 is 10.6 units below the mean, a deviation of -10.6. • The sum of these differences is zero - check this by adding all the deviations. • This summing of deviations to zero illustrates the physical interpretation of the mean as being the centre of gravity with the observations as a number of "weights in balance'. • A 6. 1.1 Untabulated data • To calculate the standard deviation we follow six steps: – Compute the mean x – Calculate the differences from the mean x x 2 – Square these differences x x 2 – Sum the squared differences x x – Average the squared differences to find variance: x x 2 n – Square root variance to find standard deviation. x x 2 n 6.1.2 Tabulated discrete data • Table 6.2, showing the number of working days lost by employees in the last quarter, typifies the tabulation of discrete data. 6.1.2 Tabulated discrete data • We need to allow for the fact that 410 employees lost no days, 430 lost one day and so on by including frequency in our calculations. • In this example there are 1440 employees in total and we need to include 1440 squared differences. • The formula for the standard deviation becomes 6.1.2 Tabulated discrete data 6.1.3 Tabulated (grouped) data • When data is presented as a grouped frequency distribution we must determine whether it is discrete or continuous (as this will affect the way we view the range of values) and determine the mid-points. • Once the mid-points have been determined we proceed as before using mid-point values for x and frequencies, as shown in Table 6.4. • The approach shown clearly illustrates how the standard deviation summarizes differences, but would be extremely tedious to perform by hand. 6.1.3 Tabulated (grouped) data • Some algebraic manipulation of the formula given in Section 6.1.2, will provide a simplified formula that is easier to work with for both calculations by hand and the construction of spreadsheets. • The simplified formula is usually presented as follows: • The formula does lose its intuitive appeal but is easier to use. • Formula of this kind can be presented in a variety of ways. Using a formula presented in different ways should not be a problem. What you do need to be sure about are the stages required in the calculations (e.g. what columns to add] and the assumptions being made (e.g. is n or [n - 1) being used as the divisor?). The use of this simplified formula is illustrated in Table 6.5. 6.1.3 Tabulated (grouped) data 6.1.4 The variance • The variance is the squared value of the standard deviation, and therefore is calculated easily once the standard deviation is known. • It is sometimes used as a descriptive measure of dispersion or variability rather than the standard deviation, but its importance lies in more advanced statistical theory. • As we will see, you can add variances but you cannot add standard deviations. • Variance is mentioned here for completeness. 6.2 Other measures of dispersion • While the standard deviation is the most widely used measure of dispersion, it is not the only one. • As we saw when looking at measures of location, different measures (mean, median and mode) are appropriate for different situations and the same is true for measures of dispersion. • Furthermore, some of the measures of dispersion are specifically linked to certain measures of location and it would not make sense to mix and match the statistics. 6.2.1 The range • The range is the most easily understood measure of dispersion as it is the difference between the highest and lowest values. • If we were again concerned with the 10 observations: 10 22 31 9 24 27 29 9 23 12 the range would be 22 cars (31 - 9). • It is, however, a rather crude measure of spread, being dependent on the two most extreme observations. • It is also highly unstable as new data is added. • If this measure is to be used, it may well be better to quote the highest and lowest figure, rather than the difference. 6.2.1 The range • The range has, however, found a number of specialist applications, particularly in quality control (range charts). • When dealing with data presented as a frequency distribution we will not always know exactly the highest and lowest values, only the group they lie in. • If the groups are open-ended (e.g. 60 and more), then any values used will merely be based on assumptions that we have made about the widths of the groups. • In such cases there seems little point in quoting either the range or the extreme values. 6.2.2 The quartile deviation • If we are able to quote a half-way value, the median, then we can also quote quarter-way values, the quartiles. • These are order statistics like the median and can be determined in the same way. • With untabulated data or tabulated discrete data it will merely be a case of counting through the ordered data set until we are a quarter of the way through and three quarters of the way through and noting the values; this will give the first quartile and third quartile, respectively. • When working with tabulated continuous data, further calculations are necessary. • Consider for example the data given in Table 6.6 (see Table 5.6 for the determination of the median). 6.2.2 The quartile deviation • The lower quartile (referred to as Q1), will correspond to the value one-quarter of the way through the data, the 11th ordered value: • and the upper quartile (referred to as Q3) to the value three-quarters of the way through the data, the 33rd ordered value: The graphical method • To estimate any of the order statistics graphically, we plot cumulative frequency against the value to which it refers, as shown in Figure 6.4. • The value of the lower quartile is £12 and the value of the upper quartile is £25 (to an accuracy of the nearest £1 which the scale of this graph allows). Calculation of the quartiles • We can adapt the median formula (see Section 5.1.3) as follows: • where O is the order value of interest, l is the lower boundary of corresponding group, i is the width of this group, F is the cumulative frequency up to this group, and f is the frequency in this group. Calculation of the quartiles • The lower quartile will lie in the group '£10 but under £15' and can be calculated thus: • The upper quartile will lie in the group '£20 but under £30' and can be calculated thus: Calculation of the quartiles • The quartile range is the difference between the quartiles: • and the quartile deviation (or semi-interquartile range) is the average difference: Calculation of the quartiles • As with the range, the quartile deviation may be misleading. • If the majority of the data is towards the lower end of the range, for example, then the third quartile will be considerably further above the median than the first quartile is below it, and when we average the difference of the two numbers we will disguise this difference. • This is likely to be the case with a country's personal income distribution. • In such circumstances, it would be preferable to quote the actual values of the two quartiles, rather than the quartile deviation. 6.2.3 Percentiles • The formula given in Section 6.2.2 for an order value, O, can be used to find the value at any position in a grouped frequency distribution of continuous data. • For data sets that are not skewed to one side or the other, the statistics we have calculated so far will usually be sufficient, but heavily skewed data sets will need further statistics to fully describe them. • Examples would include some income distributions, wealth distributions and times taken to complete a complex task. • In such cases, we may want to use the 95th percentile, i.e. the value below which 95% of the data lies. • Any other value between 1 and 99 could also be calculated. • An example of such a calculation is shown in Table 6.7. 6.2.3 Percentiles • For this wealth distribution, the first quartile and the median are both zero. • The third quartile is £4347.83. • None of these statistics adequately describes the distribution. • To calculate the 95th percentile, we find 95% of the total frequency, here 0.95 x 26 700 = 25365 and this is the item whose value we require. 6.2.3 Percentiles • It will be in the group labelled 'under £100 000' which has a frequency of 800 and a width of 50 000 (i.e. 100000 - 50 000). • Using the formula, we have: 6.2.4 Back to raw data • So far this chapter has taken us from individual numbers (raw data) through ordered data to grouped data, looking at the methods used to find the measures of dispersion. • The previous chapter did the same for measures of location. • However, the idea of grouping the data developed when calculation had to be done by hand, or at least using sliderules and calculators. • It was the only practical method when large amounts of data were being analysed. • Now we have computers and suitable software, which can deal with huge amounts of data very quickly and easily, without having to make assumptions about an even spread of data within each group, or guessing what the highest or lowest value was. 6.2.4 Back to raw data • Add to this that most data starts life as individual bits of raw data, and you can see that most of the descriptive statistics we have been discussing can be found very easily, provided someone has recorded them electronically. • An example using Excel is shown as Figure 6.5. • An example of the output from SPSS is shown as Figure 6.6. • If you are trying to describe secondary data for which you only have tabulated data, then, of course, you have to go back to the methods we have been discussing. 6.3 Relative measures of dispersion • All of the measures of dispersion described earlier in this chapter have dealt with a single set of data. • In practice, it is often important to compare two or more sets of data, maybe from different areas, or data collected at different times. • In Part 4 we look at formal methods of comparing the difference between sample observations, but the measures described in this section will enable some initial comparisons to be made. • The advantage of using relative measures is that they do not depend on the units of measurement of the data. 6.3.1 Coefficient of variation • This measure calculates the standard deviation from a set of observations as a percentage of the arithmetic mean: • Thus the higher the result, the more variability there is in the set of observations. 6.3.1 Coefficient of variation • If, for example, we collected data on personal incomes for two different years, and the results showed a coefficient of variation of 89.4% for the first year, and 94.2% for the second year, then we could say that the amount of dispersion in personal income data had increased between the two years. • Even if there has been a high level of inflation between the two years, this will not affect the coefficient of variation, although it will have meant that the average and standard deviation for the second year are much higher, in absolute terms, than the first year. 6.3.2 Coefficient of skewness • Skewness of a set of data relates to the shape of the histogram which could be drawn from the data. • The type of skewness present in the data can be described by just looking at the histogram, but it is also possible to calculate a measure of skewness so that different sets of data can be compared. • Three basic histogram shapes are shown in Figure 6.7, and a formula for calculating skewness is shown below. 6.3.2 Coefficient of skewness • A typical example of the use of the coefficient of skewness is in the analysis of income data. • If the coefficient is calculated for gross income before tax, then the coefficient gives a large positive result since the majority of income earners receive relatively low incomes, while a small proportion of income earners receive high incomes. • When the coefficient is calculated for the same group of earners using their after tax income, then, although a positive result is still obtained, its size has decreased. 6.3.2 Coefficient of skewness • These results are typical of a progressive tax system, such as that in the UK. • Using such calculations it is possible to show that the distribution of personal incomes in the UK has changed over time. • A discussion of whether or not this change in the distribution of personal incomes is good or bad will depend on your economic and political views; the statistics highlight that the change has occurred. 6.4 Variability in sample data • We would expect the results of a survey to identify differences in opinions, income and a range of other factors. • The extent of these differences can be summarized by an appropriate measure of dispersion (standard deviation, quartile deviation, range). • Market researchers, in particular, seek to explain differences in attitudes and actions of distinct groups within a population. • It is known, for example, that the propensity to buy frozen foods varies between different groups of people. 6.4 Variability in sample data • As a producer of frozen foods you might be particularly interested in those most likely to buy your products. • Supermarkets of the same size can have very different turnover figures and a manager of a supermarket may wish to identify those factors most likely to explain the differences in turnover. • A number of clustering algorithms have been developed in recent years that seek to explain differences in sample data. 6.4 Variability in sample data • As an example, consider the following algorithm or procedure that seeks to explain the differences in the selling prices of houses: • 1 Calculate the mean and a measure of dispersion for all the observations in your sample. In this example we could calculate the average price and the range of prices (Figure 6.8). 6.4 Variability in sample data • It can be seen from the range that there is considerable variability in price relative to the average price. • Usually the standard deviation would be preferred to the range as a measure of dispersion for this type of data. • 2 Decide which factors explain most of the difference (range) in price, for example, location, house-type, number of bedrooms. • If location is considered particularly important, we can divide the sample on that basis and calculate the chosen descriptive statistics (Figure 6.9). 6.4 Variability in sample data • In this case we have chosen to segment the sample by location, areas X and Y. • The smaller range within the two new groups indicates that there is less variability of house prices within areas. • We could have divided the sample by some other factor and compared the reduction in the range. • 3 Divide the new groups and again calculate the descriptive statistics. We could divide the sample a second time on the basis of house-type (Figure 6.10). • 4 The procedure can be continued in many ways with many splitting criteria. • A more sophisticated version of this procedure is known as the automatic interactive detection technique. Case 2: using measures of difference and performance • Managers are likely to meet a number of measures of difference and increasingly also various measures of performance (benchmarking, for instance, has become an important management tool, where targets are determined using the performance of the 'best' organizations on certain measures). • Managers need to be able to respond to this type of information with insight and confidence. Case 2: using measures of difference and performance • It is important for managers to clarify what these measures mean in business terms and what the underlying assumptions are. • In the same way that you don't need to be an accountant to use accounting information, you don't need to be a statistician to use statistical information. • Managers should look for a business understanding in the information they are given and develop responses that allow their organization to interpret and apply such information. • Knowing the assumptions will reveal some of the thinking of those that devised them. • Management is a process that involves a judgement as to what is appropriate and when.