Data handling is the study of statistics, or data. We collect, organise, analyse and interpret data. The data can inform students, researchers, advertising and business.
It can provide us with an understanding of social issues and human trends. Then we can make informed decisions when we plan for the future, or make a new advertisement, or address social issues.
We usually collect data from a fairly small group (called the sample). The sample must be big enough and it must be randomly chosen from the population. This is to make sure that it fairly represents the trends in the larger group of people (called the population).
1
Sample: Some data randomly chosen from population.

13.1 Bar graphs and frequency tables

Data can be represented with a frequency table or with a bar graph. Each bar represents a group of data and the bars can be compared to each other. Bar graphs must be labelled on one axis and show the numbering on the other axis.

e.g.1
In a Geography class, 23 learners completed a test out of 10 marks. Here is a list of their results:
4; 1; 2; 2; 6; 9; 6; 10; 6; 8; 9; 6; 7; 7; 8; 4; 6; 6; 5; 7; 9; 10; 6.
We can use a frequency table to record this data.
Frequency table

Mark out of 10  Tally  Number of learners who achieved this mark
(frequency) 
1  /  1
2  //   2
3    0
4  //  1
5   /  1
6  //// //   7
7  ///  3
8  // 2
9  /// 3
10  // 2

We can also make a bar graph to show this data. Use the marks from 1 to 10 on the horizontal axis. Use the number of learners who got that score on the vertical axis. The number of learners is the frequency.
2

13.2 Measures of central tendency

13.2.1 Ungrouped Data
Measures of central tendency are different measures of finding the ‘middle’ or ‘average’ of a set of data. The three kinds of ‘middle’ of a set of data that we use are the mean, the median and the mode.
It is advisable that we start by arranging the set of data in an ascending order before attempting questions.

  1. Mean
    The mean of the data is the average if you add all the values and divide by the number of values. We use the symbol for the mean.
    3
    This formula will be on the information sheet provided in examinations.

    e.g.2
    In a Mathematics class, 23 learners completed a test out of 25 marks.
    Here is a list of their results:
    14; 10; 23; 21; 11; 19; 13; 11; 20; 21; 9; 11; 17; 17; 18; 14; 19; 11; 24; 21; 9; 16; 6.
    Calculate the mean of this data.
Solution
4=   sum of values in set  
                 number of values in set
= 14 + 10 + 23 + 21 + 11 + 19 + 13 + 11 + 20 + 21 + 9 + 11 + 17 + 17 + 18 + 14 + 19 + 11 + 24 + 21 + 9 + 16 + 6
                                                                                       23
= 15,4347… 3 (2) 
  1. Median
    The median is the middle number in an ordered data set.

    e.g3
    In a Mathematics class, 23 learners completed a test out of 25 marks. Here is a list of their results:
    14; 10; 23; 21; 11; 19; 13; 13; 20; 21; 9; 13; 17; 17; 18; 14; 19; 13; 24; 21; 9; 16; 6.
    Calculate the median of this data.

    Solution

    • First put the data in order, from lowest to highest.
      6; 9; 9; 10; 11; 13; 13; 13; 13; 14; 14; 16; 17; 17; 18; 19; 19; 20; 21; 21; 21; 23; 24.
      There are 23 numbers, so the middle number is the12th number out of 23 numbers. So 16 is the
      median, the number in the middle of the data.
    • When there is an even number of values in the data set, the median lies halfway between the middle
      two values.
    • We can add these two values and divide by 2. For [Example], what if another learner wrote the test
      and her result was 7? We can add this to the ordered data set.
      6; 7; 9; 9; 10; 11; 13; 13; 13; 13; 14; 14; 16; 17; 17; 18; 19; 19; 20; 21; 21; 21; 23; 24.
      Now there are 24 numbers and the middle two numbers are the 12th and 13th numbers. The middle
      two numbers are 14 and 16. Add 14 and 16 to get 30 and divide by 2 to get a median of 15.
      14+16 = 15 
           2
  2. Mode
    The mode is the number or value that appears most frequently in the data set.

    e.g.4
    In a Mathematics class, 23 learners completed a test out of 25 marks.
    Here is a list of their results:
    14; 10; 23; 21; 11; 19; 13; 13; 20; 21; 9; 13; 17; 17; 18; 14; 19; 13; 24; 21; 9; 16; 6.
    Find the mode of this data.
    Solution
    The mode of the results of the test is 133 (13 appears 4 times). (1)

    Summary

    3
    median: middle score of an ordered list of data
    mode: the most frequent score 

Activity 1
The table below represents Mathematics test scores and frequency for each score.

Scores (x)  Frequency (f) 
13 5
17 6
20 4
25 10
  1. Determine the median (2)
  2. Determine the mean (2)
    [4]

Solutions

  1. Σf = 25 i.e. there are 25 scores. To determine the median, find the position of the median by adding the frequencies until you reach the position of the median.
    Median lies in position 13, hence median = 20 (2)
  2. mean = 5(13) + 6(17) + 4(20) + (10)23 = 477 = 19,08  (2)
                                          25                         25 
    [4] 

13.2.2 Grouped data

e.g.5
Fifty shoppers were asked what percentage of their income they spend on groceries.
Six answered that they spend between 10% and 19%, inclusive. The full set of responses is given in the table below.

PERCENTAGE  FREQUENCY (f) 
10 < x < 19 6
20 < x < 29 14
30 < x < 39  16
40 < x < 49  11
50 < x < 59  3
  1. Calculate the mean percentage of family income allocated to groceries.
  2. In which interval does the median lie?
  3. Determine the mode percentage of income spent on groceries.
Solutions
  1. Determine midpoints of each interval. Since we do not have the exact values in grouped data, we use these approximations
  2. Add al the frequencies to get the number of items in adata set
  3. Determine the total of all

  1. PERCENTAGE  Midpoint Interval FREQUENCY (f)  Total (fx)
    10 < x < 19 14,5 6 14,5 × 6 = 87
    20 < x < 29 24,5 14 24,5 × 14 = 343
    30 < x < 39  34,5 16 34,5 × 16 = 552
    40 < x < 49  44,5 11 44,5 × 11 = 489,5
    50 < x < 59  54,5 3 54,5 × 3 = 163,5
    Sum   n=50   Σ(fx) = 1635
    Mean = Σfx = 32,7(2)
                  50 
  2. 30 < x < 39  (median is in position 25,5 of the data. When we add frequencies from above, then position 25,5 lies in the interval 30 < x < 39) (1)
  3. 30 < x < 39  (the interval with the highest frequency) (1) 

There are 50 scores. median lies between positions 25 and 26
6+14=20
20+16=36

13.3 Measures of dispersion (or spread)

The measures of dispersion give us information about how spread out the data is around the median. The measures of central tendency give us information about the central point of the data, but we still need to know if the data is concentrated in one place, or evenly spread out.
We first look at these measures of dispersion: range and interquartile range.

  1. Range
    The range is the difference between the highest value (or maximum) and the lowest value (or minimum) in a data set.
    Range = largest value in the data set – smallest value in the data set
    e.g.6
    Find the range of the Mathematics test results:
    6; 9; 9; 10; 11; 13; 13; 13; 13; 14; 14; 16; 17; 17; 18; 19; 19; 20; 21; 21; 21; 23; 24.
    Solution
    24 – 6 = 18. So the range of the test results is 18. 
  2. The interquartile range
    • The interquartile range depends on the median. So organise the data first and find the median.
    • The data is divided into four parts (quarters, which we call quartiles). First, the median (Q2) divides the data into two halves.
    • The lower quartile (Q1) divides the data below the median (Q2) into two equal sets of data.
    • The upper quartile (Q3) divides the data above the median into two equal sets of data.
    • The difference between the lower and the upper quartile (Q3 – Q1) is called the interquartile range. This tells us how spread out the middle half of the data is around the median.
      5

e.g.7
Find the interquartile range of the Mathematics test results:
6; 9; 9; 10; 11; 13; 13; 13; 13; 14; 14; 16; 17; 17; 18; 19; 19; 20; 21; 21; 21; 23; 24.

Solution

  • We already know that the median is 16.
  • Lower half of the data has 11scores, so Q1 is the 6th data item.
    ∴ Q1 = 13
  • Upper half of the data set has 11 scores, so Q3 is the 6th score of the upper half data set ∴ Q3 = 20
    6

N/B Median is not included in the lower half and upper half of the data when calculating Q1 and Q3 

We can also use the ff formulae to calculate the position of Q1, Q2 and Q3. Position of Q2
= (n+1) =(23 + 1) = 12
       4           2
Q2 is the value in position 12 which is 16
Position of Q1
= (n + 1) = (23 + 1) = 6
        4             4
is the value in position 6 which is 13
Position of Q3
3(n+1) = 3(23+1) = 18
    4              4
Q3 is the value in position 18 which is 20

Activity 2
If the test scores in another class are represented by the data below, find the interquartile range of the test results:
6; 7; 9; 9; 10; 11; 13; 13; 13; 13; 14; 14; 16; 17; 17; 18; 19; 19; 20; 21; 21; 21; 23; 24. [8]

Solution
We already know that the middle two numbers are 14 and 16.

  • The data set has an even number of points, so the median will lie between 14 and 16 (3). Use the
    lower value 14 in the lower half and the upper value 16 in the upper half. (3)
  • Lower half: 12 numbers, so use the 6th and 7th numbers to find the lower quartile. 11 + 13 = 12 (3)
                                                                                                                                            2
  • Upper half: 12 numbers, so use the 6th and 7th numbers to find the upper quartile. 19 + 20 = 19,5 (3)
                                                                                                                                             2
    7

[8]

13.4 Five number summary and box and whisker plot

  1. Five number summary
    The five number summary is a ‘summary’ description of a data set.
    It is made of these five numbers:
    • the minimum value
    • the lower quartile
    • the median
    • the upper quartile
    • the maximum value

      e.g.8
      What is the five number summary for the set of data we have used so far?
      6; 9; 9; 10; 11; 13; 13; 13; 13; 14; 14; 16; 17; 17; 18; 19; 19; 20; 21; 21; 21; 23; 24.
    • the minimum value: 6
    • the lower quartile: 13
    • the median: 16
    • the upper quartile: 20
    • the maximum value: 24
  2. Box and whisker plot
    We can represent the five number summary on a box and whisker plot (or diagram).
    The box represents the middle half of the data (the interquartile range)
    The line in the box shows the median.
    The ‘whiskers’ show the minimum and maximum values.
    8Quartiles divide data into four equal sets of data. The longer whisker and box means that the lower 50% of the scores is more spread out than the upper 50%.
    Skewed to the right (Positively skewed) means that the upper half of the data is more spread out than the lower half.
    Skewed data
    A box and whisker plot can show whether a data set is symmetrical, positively skewed or negatively skewed. This box and whisker plot is not symmetrical because the whiskers are not the same length and the median is not in the centre of the box. The whisker on the left is a bit longer than the whisker on the right, which shows that the data on the left of the box is more spread out. The box is also longer to the right of the median than to the left of the median. We say that the data is negatively skewed. (or skewed to the left).
  3. Identification of outliers

e.g.9
Determine whether the minimum in the [Example] above is an outlier or not.

Solution
Inter-quartile range = Q3 – Q1
= 20 – 13
= 7
Q1 – 1,5 × IQR = 13 – 1,5 × 7
= 2,5
6 > 2,5 ∴ 6 is not an outlier 

HINT:
To determine outliers:

  • Determine the interquartile range
  • Determine Q1 – 1,5 × IQR
  • If the minimum < the value of Q1 – 1,5 × IQR, then it is an outlier.
  • Determine Q3 – 1,5 × IQR
  • If the maximum > Q3 – 1,5 × IQR,
    Then it is an outlier.

Activity 3

  1. These are the scores of ten students in a Science test:
    90; 85; 10; 75; 70; 60; 78; 80; 82; 80; 55; 84
    1. Draw a box and whisker diagram for the given data. (5)
    2. Determine the interquartile range. (2)
    3. State whether the data is skewed or not. (1)
    4. State whether 10 is an outlier or not. (2)
      [10]

Solutions

  1. First write all the scores in ascending order.
    10; 55; 60; 70; 75; 78; 80; 80; 82; 84; 85; 90
    Work out the five number summary:
    Minimum number: 10
    Maximum number: 90
    Median: 12 numbers, so use the 6th and 7th numbers 78+80 = 79
                                                                                               2
    Lower quartile: Use the first 6 numbers. The 3rd and 4th numbers are 60 and 70
    Upper quartile: Use the last 6 numbers. The 3rd and 4th numbers are 82 and 84.
    9
  2. Interquartile range = upper quartile – lower quartile3 = 83 – 65 = 183 (2)
  3. The data is skewed to the left (negatively skewed).  (1)
    The whisker on the left is longer, i.e. the length on the left of the box is longer than the length on the right.
  4. Interquartile range (IQR) = Q3 – Q1
    = 83 – 65
    = 18
    Q1 – 1,5 × IQR = 65 – 1,5 × 18
    = 38 338
    10 < 38 3 conclusion
    ∴ 10 is an outlier (2)
    [10] 

Activity 4

  1. The five number summary of heights of trees three months after they were planted is (23 ; 42 ; 50 ; 53 ; 75). This information is shown in the box and whisker diagram below.
    10
  2. Determine the interquartile range. (2)
  3. What percentage of plants has a height excess of 53 cm? (2)
  4. Between which quartiles do the heights of the trees have the least variation? Explain. (2)
    [6]

Solutions

  1. Interquartile range = 53 – 42 = 11(2)
  2. 25% (2)
  3. Between Q2 (50) and Q3 (53) The distance between these two quartiles is the smallest 3 (2)
    [6] 

13.5 Histograms and frequency polygons

  • Histograms and frequency polygons are graphs used to represent grouped and continuous data. They show the frequency and the distribution (spread) of the data.
  • Continuous data is data that is not just measured in whole numbers. For example, length, mass, volume or time are measured in continuous amounts.
  • The horizontal axis of a histogram and a frequency polygon have a continuous scale.
  • The vertical axis shows the frequency, or number of times the data is listed.
  1. Grouped data:

    Instead of recording every piece of data separately, we can group the data to make it easier to read. Grouped data can be represented on a histogram or a frequency polygon.
    e.g.10
    A grocer wants to record the mass of each packet of chicken pieces he sells. He groups the masses into intervals of 0,2 kg. He makes a frequency table.

    Mass of chicken in kg Number
    0,8 < mass of chicken ≤ 1,0 0
    1,0 < mass of chicken ≤ 1,2  3
    1,2 < mass of chicken ≤ 1,4 8
    1,4 < mass of chicken ≤ 1,6  6
    1,6 < mass of chicken ≤ 1,8 2
    1,8 < mass of chicken ≤ 2,0 1
    2,0 < mass of chicken ≤ 2,2 0

    These 8 packets have any mass between a bit more than 1,2 kg and 1,4 kg.
    So 1,2 < mass of chicken < 1,4

  2. Histograms
    From the frequency table, he draws up a histogram. 
    11

  3. Frequency polygons
    We can also make a frequency polygon using this data. A frequency polygon uses lines to join the mid-points of each interval. The polygon must begin and end on the horizontal axis. So we can add an interval at the beginning and the end of the data which both have a frequency of 0.
    12

The frequency polygon can also be drawn using the midpoints of the bars of the histogram, as shown below.
13
HINT:To plot a frequency polygon:

  • Plot midpoints of each interval
  • Join the midpoints by straight lines
  • Add an interval at the beginning and end of the data, with both frequencies equals to 0.
  • Frequency polygon is a closed figure, hence must start and end at the x-axis.

13.6 Cumulative frequency tables and graphs (ogives)

  1. Cumulative frequency tables
    • Cumulative frequency gives us a running total of the frequency. So we keep adding onto the frequency from the first interval to the last interval.
    • We can show these results in a cumulative frequency table.

      e.g.11


      In an English class, 30 learners completed a test out of 20 marks. Here is a list of their results:
      14; 10; 11; 19; 15; 11; 13; 11; 9; 11; 12; 17; 10; 14; 13; 17; 7; 14; 17; 13; 13; 9; 12; 16; 6; 9; 11; 11; 13; 20.
      Mark out of 20  Tally   Frequency
      (number of learners) 
      Cumulative
      frequency 
      6  /  1 1
      7  /  1 1 + 1 = 2 
      8    0 2 + 0 = 2 
      9  ///  3 2 + 3 = 5
      10  //  2 5 + 2 = 7 
      11  //// /  6 13
      12  //  2 15
      13  //// 5 20
      14  /// 3 23
      15  / 1 24
      16  / 1 25
      17  /// 3 28
      18   0 28
      19  / 1 29
      20  / 1 30
      With this data set, it would be more useful to group the data.
      Keep adding onto frequency from row before.
      For example, 7 + 6 = 13
      The last number is the same as total number of learners
      We can use intervals of 5 and make a cumulative frequency table for grouped data.
      Class interval Frequency Cumulative frequency
      1 < x < 5  0 0
      5 < x < 10  7 7
      10 < x < 15  17 24
      15 < x < 20  6 30

       

  2. Cumulative frequency graph (ogive)
    • We can represent the cumulative results from a cumulative frequency table with a cumulative frequency graph or ogive.
    • This graph always starts on the x-axis and usually forms an S-shaped curve, ending with the cumulative frequency (y-value).
    • The endpoint of each interval is plotted against the cumulative frequency.

      e.g.12

      Represent the data in the cumulative frequency table of grouped data with a cumulative frequency graph.
    • the x-axis needs the points 5; 10; 15; and 20 to mark the end of each interval.
    • the y-axis represents the cumulative frequency from 0 to 30.
    • For plotting the points, use the end of each class interval on the x-axis and the cumulative frequency on the y-axis. So you need to plot these points: (5; 0); (10; 7); (15; 24); (20; 30)
    • Join the plotted points.
      14
      To plot ogive:
    • x-coordinate - use upper limit of each interval.
    • y-coordinate - cumulative frequency
    • If the frequency of the first interval is not 0, then include an interval before the given one and make use 0 as its frequency.


      Activity 5

      An ice cream vendor has kept a record of sales for October and November 2012. The daily sales in rands is shown in the histogram below.
      15
      1.1 Draw up a cumulative frequency table for the sales over October and November. (2)
      1.2 Draw an ogive for the sales over October and November. (3)
      1.3 Use your ogive to determine the median value for the daily sales. Explain how you obtain your answer. (1)
      1.4 Estimate the interval of the upper 25% of the daily sales. (2)
      [8]

      Solutions
      1.1 Cumulative frequency table:
      Daily sales (in rand) Frequency Cumulative frequency
      60 ≤ rand < 70   5 5
      70 ≤ rand < 80  11 16
      80 ≤ rand < 90  22 38 1st three correct 
      90 ≤ rand < 100  13 51
      100 ≤ rand < 110  7 58
      110 ≤ rand < 120  3 61 last three correct 
      1.2 
      16
      1.3 There are 61 data points, so the median is the 31st data point.
      We can read the data point off the graph at 31. It gives a rand value
      of R87. 3 (1)
      1.4 The upper 25% lies above 75% of 61 = 45,75. 3
      Read from the y-axis across to the graph and down to the x-axis.
      The upper 25% of sales lies in the interval: 96 ≤ sales < 120 3(2)
      [8]
      • 1st three points plotted correctly
      • last three points plotted correctly
      • grounding at 0

      We can find the median, the range and the interquartile range from a cumulative frequency graph.
      We cannot find the mean from a cumulative frequency graph

13.7 Variance and standard deviation

Sometimes the mean is a more useful measure of central tendency than the median.
The measures of dispersion (spread) around the mean are called the variance and the standard deviation.

1. Standard deviation
The standard deviation is the square root of (the sum of the squared differences between each score and the mean divided by the number of scores). The formula for standard deviation is:

17=  where x is each individual value, x is the mean and n is the number of values. The symbol sigma Σ means ‘the sum of’.      
This formula will be on the data sheet. Make sure you can use the formula properly. 

1.1 Calculating the standard deviation using the formula:
17

  1. Find the mean of all the numbers in the data set.
  2. Find each value of 22 In other words, work out by how much each of these values differs (or deviates) from the mean.
  3. Square each deviation. Find each value of 18
  4. Add all the answers together. In other words, find 19
  5. Divide this sum by the number of values, n.
  6. You have now found 20This value is called the variance.
  7. Find the square root of the variance to find the standard deviation. 21
  8. By working through these steps, you have found the standard deviation using the formula.

e.g.13
Finding the variance and standard deviation
These are the results of a mathematics test for a Grade 11 class of 20 students.
52 44 62 66 60 57 95 78 71 62
100 69 62 72 73 55 32 83 78 80

  1. Calculate the mean mark for the class. (2)
  2. Complete the table below and use it to calculate the standard deviation of the marks. (3)
  3. What percentage of the students scored within one standard deviation of the mean? (2)

Solutions
1.23
2.

Mark obtained (%)  22 18
52 52 52 – 67,55
= –15,55 
(–15,55)2
= 241,8 
44 –23,55  554,6 
62 –5,55 30,8 
66 –1,55  2,4 
60 –7,55  57,0 
57 –10,55   111,3 
95 27,45  753,5 
78 10,45 109,2
71 3,45 11,9
62 –5,55 30,8
100 32,45 1053,0
69 1,45 2,1
62 –5,55 30,8
72 4,45 19,8
73 5,45 29,7
55 –12,55 157,5
32 –35,55 1263,8
83 15,45 238,7
78 10,45 109,2
80 12,45 155,0 
    4 962,9

σ = √4 962,9 = 15, 7526...
            20 
= 15,75 answer
(correct to 2 decimal places)

3. One standard deviation from the mean lies between
24
16 scores lie in the interval (51,8; 83,3)
16 out of 20 of the marks lie within one standard deviation of the mean, 16/20 × 100 = 80%
Answer: 80% of the students’ marks lie within one standard deviation from the mean. 

The squaring of    deals with the effect of the negative signs.
At the end, we find the square root of the whole answer to ‘reverse’ the effect of the square.

1.2 Steps for calculating the standard deviation with a scientific calculator:
Using a Casio fx–82 ES PLUS calculator: press Mode then STAT then 1 – VAR

  • enter all data one at a time pressing = after each entry.
  • press the orange AC button
  • press shift STAT then VAR
  • in order to calculate the mean press 2: 25
  • once all these steps have been completed, simply press AC shift STAT then VAR
  • now press 3: σ to calculate the standard deviation.

If you understand the calculator steps and use them properly, you will come to the same answer of 15,75 as we found before.
Practise these steps so that you can do exam Examples using a calculator.

  • the interquartile range measures a spread around the median, so it has to do with the positions of data and not their actual values.
  • the standard deviation measures a spread around the mean, using the actual values of the data and not just their positions


Activity 6
The data below shows the energy levels, in kilocalories per 100 g, of 10 different snack foods.
440 520 480 560 615 550 620 680 545 490

  1. Calculate the mean energy level of these snack foods. (2)
  2. Calculate the standard deviation. (2)
  3. The energy levels, in kilocalories per 100 g, of 10 different breakfast cereals had a mean of 545,7 kilocalories and a standard deviation of 28 kilocalories. Which of the two types of food show greater variation in energy levels?
    What do you conclude? (2)
    [6]

Solutions

  1. Mean = 5500 = 550 kilocalories (2)
                   10
  2. σ = 69,03 kilocalories 33 (2)
  3. Snack foods have a greater variation. 3 The standard deviation for
    snack foods is 69,03 kilocalories whilst the standard deviation for
    breakfast cereals is 28 kilocalories. i.e energy levels of breakfast cereals
    is spread closer to the mean than in those of the snack food. 3 (2)
    [6] 

2. The normal distributive curve
The data can be plotted on a graph that shows the standard deviations. If the data is distributed symmetrically around the mean, the values form a normal distribution curve:
26

13.8 Bivariate data and the scatter plot (scatter graph)

  • A scatter plot is a graph using the x- and y-axes to represent bivariate data.
  • Bivariate data means that each point on the graph represents two variables that are independent of each other.
  • In a scatter plot, we plot a point for each pair of coordinates and look at the overall pattern or trend in the data.
  • The points in the data are compared to see if there is a correlation or some kind of pattern (or trend) in the data.
  • When a point does not fit the trend of the other points, it is called an outlier.
  • Outliers are easy to identify on a scatter diagram or a box and whisker diagram.
  • We can sometimes represent the trend in the data with a line or curve of best fit. The line or curve can be represented by an equation that could be linear, quadratic, exponential, hyperbolic etc.

e.g.14
A science teacher compares the marks for the mid-year examination with the marks for final examinations achieved by 11 learners.

mid-year
marks 
80  68 94 72 74 83 56 68 65 75 88
final marks 72 71 96 77 82 72 58 83 78 80 92
  1. Draw a scatter graph of this data. (3)
  2. Describe the curve of best fit. (2)
  3. Use the scatter plot to estimate the final mark of a learner who had a midyear mark of 75%. (1)
    [6]

Solutions


  1. 27
    all points plotted correctly
  2. The ‘curve’ or line of best fit is a straight line. There should be about five dots
    above the line and five dots below the line. 
  3. A line from 75 on the x-axis to the trendline takes us to about 78 on the y-axis. So
    we can predict that a learner with a midyear mark of 75% can expect to get about
    78% in the final exam.
    [6] 


Activity 7
The outdoor temperature (in °C) at noon is measured. It is compared with the number of units of electricity used to heat a house each day.

Temp in °C  11 9 2 4 7 0 10 5 3
units of electricity used  32  20 27 37 32 28 41 23 33 36
  1. Draw a scatter graph to represent this data. (3)
  2. Draw in a line of best fit. (1)
  3. Use your line of best fit to estimate the noon temperature when 30 units of electricity are used. (1)
    [5]

Solutions

  1. Graph (3)
  2. Line of best fit (1)
    28
  3. If the noon temperature is 6,25°C, about 30 units 3 of electricity will probably be
    used in the house. (1)
    [5] 

13.9 The linear regression line (or the least squares regression line)

The line of best fit for a set of bivariate numerical data is the linear regression line. So far, we have seen this trend line on a scatter graph.
Now we use a scientific calculator to determine the equation for this line.
We know the straight line equation: y = mx + c
Statistics (as used on the CASIO x-82ES PLUS calculator) uses y = A + Bx, where B is the gradient and A is the cut on the y-axis of the straight line of best fit.
So the gradient is B instead of m and the y-intercept is A instead of c.
The Regression Coefficient ‘r’
This is a statistical number that measures the strength of the correlation (relationship) between two sets of data.

  • This number is calculated from two sets of data using a calculator.
  • r always lies between –1 and +1.
  • The closer r is to –1, the stronger the negative correlation.
  •  The closer r is to +1, the stronger the positive correlation.
  • If r = 0, there is no correlation between the two sets of data.

The number line shows the r values and the strength of the correlation between bivariate data.
29We only study the r value of bivariate data when the line of best fit is a straight line.

A negative correlation means that as x increases, y decreases. The more closely the points are clustered together around the line, the stronger the correlation.
30
A positive correlation means that as x increases, y also increases A correlation of zero means that there is no relationship between x and y.
31

e.g.15
A diesel engine turns at a rate of x revolutions per minute. The corresponding horse power of the engine is measured by y in the table below:

x (revolutions per minute) 400  500 600 700 750
y (horse power)  580 1030 1420 1880 2100
  1. Find the equation of the least squares regression line: y = A + B x (correct to two decimal places).
  2. Determine the regression coefficient r. Discuss the correlation between x and y.
  3. Use this regression line to estimate the power output when the engine runs at 800 rev per min.
  4. About how fast is the engine running when it has an output of 1 200 horse power?
  5. A strong negative
Solutions 

1. Use a calculator

  • Mode 2: STAT
  • 2: A + B x
  • Enter the x values first:
  • 400 =; 500 =; 600 =; 700 =; 750 =
  • Use arrows to move right to y column and up to start next to 400.
  • Enter y values:
  • 580 =; 1030 =; 1420 =; 1880 =; 2100=
  • Press (orange) AC button
  • Press SHIFT STAT (at 1)
  • Press 5: Reg
  • Press 1: A = and get –1145,792683

This is the y-intercept of the regression line

  • Press orange AC button
  • Press SHIFT STAT
  • Press 5: Reg
  • Now press 2: B = and get 4,318292683
    This is the gradient of the regression line

Answer:
The least squares regression line:
y = –1 145,8 + 4,32 x (correct to 2 decimal places)

2. Keep all the information in the calculator from 1.

  • Press AC
  • SHIFT STAT
  • 5:REG
  • Then 3: r = 0,9996821357

There is a strong positive correlation between x and y (r is very close to +1)

3. Substitute x = 800 into line of best fit equation:
y = –1 145,8 + 4,32(800)
y = 2 310,2
800 revolutions will generate 2 310,2 output of horse power.

4. Let y = 1 200
1 200 = –1 145,8 + 4,32 x
1 200 + 1 145,8 = 4,32 x
2 345,8 = 4,32 x
2 345,8 = x
   4.32
543,0092593 = x
There are about 543 revolutions per min for an output of 1 200 horse power.

Remember to change the MODE back to 1:COMP when doing normal calculations

The scatter plot and the line of best fit show the trend in the relationship between the revolutions and the horse power.
32


Activity 8

  1. Pick ’n Pay wants to survey how long in seconds (y) it takes a teller to scan (x) items at the till.
    The table shows the results from 9 shoppers.
    Shoppers  A B C D E F G H I
    x (no of items  5 8 12 15 15 17 20 21 25
    y (time in seconds) 3 11 9 6 15 13 25 15 13
    1. Use your calculator to determine the equation of the line of best fit (the regression line or the least squares regression line) correct to two decimal places. (3)
    2. Calculate the value of r, the correlation coefficient for the data. What can you say about the correlation between x and y? (3)
    3. How long would the teller take to scan 21 items at the till? (2)
    4. How many items could a teller scan in 21,28 seconds? (2)
  2. A restaurant wants to know the relationship between the number of customers and the number of chicken pies that are ordered.
    number of customers (x)  10 15 20 25 30 35 40
    number of chicken pies (y) 3 5 10 10 15 20 20 24
    1. Determine the equation of the regression line correct to two decimal places. (3)
    2. Determine the value of r, the correlation coefficient. Describe the type and strength of the correlation between the number of people and the number of chicken pies ordered. (3)
    3. Determine how many chicken pies 100 people would order. (2)
    4. If they only have 12 pies left, how many people can they serve? (2)
      [20]

      Solutions

        1. A = 2,68 
          B = 0,62 
          y = 2,68 + 0,62x (3)
        2. r = 0,62847…. = 0,63 
          This is a weak positive correlation  (3)
        3. y = 2,68 + 0,62(21)  = 15,7
          (about 16 seconds)  (2)
        4. 21,28 = 2,68 + 0,62 x 3
          21,28 – 2,68 = 0,62 x
          18,6 =  x
          0.62
          30 = x
          30 items can be scanned in 21,28 seconds.  (2)
        1. A = –0,39285… 
          B = 0,61190 
          y = –0,4 + 0,6 x 3 (3)
        2. r = 0,9866… 
          This is a very strong positive correlation 
          (r is close to +1) (3)
        3. y = –0,4 + 0,6 x
          y = –0,4 + 0,6(100) 
          y = 59,6
          About 60 chicken pies are ordered by 100 3 people. (2)
        4. 12 = –0,4 + 0,6 x 3
          12 + 0,4 = 0,6 x
          12,4 = x
           0.6
          20,6… = x
          About 21 people will order 12 pies.  (2)
          [20]
  3. A recording company investigates the relationship between the number of times a CD is played by a national radio station and the national sales of the same CD in the following week. The data below was collected for random sample of CDs. The sales figures are rounded to the nearest 50

    Number of times CD is played  47 34 40 34 33 50 28 53 25 45
    Weekly sales of
    the CD 
    3950 2500 3700 2800 2900 3750 2300 4400 2200 3400
    1. Identify the independent variable. (1)
    2. Draw a scatter plot of this data. (3)
    3. Determine the equation of the least squares regression line. (3)
    4. Calculate the correlation coefficient. (2)
    5. Predict, correct to the nearest 50, the weekly sales for a CD that was played 45 times by the station in the previous week. (2)
    6. Comment on the strength of the relationship between the variables. (1)
      [12]

      Solutions
      a) the number of times the CD is played (1)

      b)
      33
      (3)

      c) a = 264,326 
      b = 75,21 
      y = 264,33 + 75,21x 3 (3)

      d) r = 0,95  (2)

      e) y = 264,33 + 75,21x(45) (substitution)
      ≈ 3 648,78
      ≈ 3 648
      ≈ 3 650 (to the nearest 50) (2)

      f) There is a very strong positive relationship between the number of times that a CD was played and
      the sales of that CD in the following week. (1)
      [12] 

What you need to be able to do:

  1. Determine mean, median and mode in grouped or ungrouped data
  2. Draw and analyse the following methods for representing data:
    • box-and-whisker plots
    • histograms
    • frequency polygons
    • cumulative frequency curves (ogives)
    • Calculate the variance and the standard deviation of a set of ungrouped data.
    • Comment on whether a data set is symmetric or skewed, by analysing the representation of the data.
    • Identify outliers in a set of data by looking at the box-and-whisker plot or scatterplot.
    • Determine the equation of the line of best fit of bivariate data using a calculator. (This line could be called the “least squares regression line”.)
    • Determine the regression correlation coefficient “r”.
    • Use the line of best fit to draw conclusions.
Last modified on Wednesday, 08 September 2021 11:49