The purpose of this practical is:
1. To highlight the importance of looking at, and thinking about, your data before trying to do too much with it:
what we might call exploratory data analysis.
2. To introduce you to a specific technique - 'nearest neighbour' linkage - which enables us to create classifications
of seemingly diverse phenomena.
3. To use this technique to explore aspects of climatic variability within the British Isles.
List of Figures
Fig. 1: Map of the British Isles, showing
the location of the 16 stations from which the initial climate data are taken.
Fig. 2: Scatterplot showing the initial 16
data pairs.
Fig. 3: Revised scatterplot showing situation
after 7 groupings made.
Fig. 4: Part-complete linkage graph.
Fig. 5: Part-complete linkage tree, showing
progress of the grouping procedure.
Fig. 6: Scatterplot of spatial distance vs.
'climate distance' for the initial 120 data pairs.
List of Tables
Table 1: Distance matrix, showing distances
between the initial data pairs.
Table 2: The initial data set.
Table 3: Group status after 7 groupings have
been made.
Table 4: Part-complete linkage table.
You are required to amend/complete as instructed the figures and tables identified in bold type. These are
supplied on paper, and must be submitted for marking alongside your written work. Other figures and
tables are not required for marking purposes.
INTRODUCTION
The activities which we usually think of as "science" - which would include much of what usually
passes as "geography" - try to explain why the real world is like it is. This usually involves trying
to develop certain general ideas which seem to work across a large number of different examples. These workable,
general ideas are usually elevated to the status of 'theory' or 'law'. However, laws and theories rarely spring
to mind without having to put a lot of work in first (although this is not to say that flashes of inspiration stimulated
by the likes of falling apples do not play an important part in the process of scientific discovery!). Very often
we start with only a vague idea of what we are dealing with (e.g. some crude idea of 'climate'), in which case
the primary problem is not the analysis of what meaningful relationships exist within and between certain categories
of phenomena, but what are the different categories into which we put phenomena in the first place?
This process of allocating individual phenomena to broader conceptual categories - i.e. CLASSIFICATION -
is often seen as the second key stage of scientific enquiry (the first stage is that of observation which gives
rise to the data to be classified). By grouping like phenomena into broad conceptual categories, and thereby separating
dissimilar phenomena, certain patterns may be revealed which enable us to derive theories about the workings of
the real world. Sometimes this classification procedure is intuitive: we automatically slot data/phenomena into
different categories without thinking too deeply about it - although this runs the risk of creating groups which
have no real justification (e.g. it might to dangerous to assume that Scotland's climate is distinct from the rest
of the British Isles...). Other data sets, however, may defy intuitive classification: e.g. the jumble of rock
debris dumped at the edges of a glacier can often require careful examination before different debris types (e.g.
debris carried on top of the ice, at the base of the ice, or by subglacial streams) can be separated and identified.
In this kind of situation, relatively formal procedures, such as the one we will use in this practical, are likely
to be useful.
So, why are we doing this?
The key point to take away - and the purpose of this practical - is that it is usually a good idea to look at your
data, and try to make sense of it, before you try to do anything with it. Do not waste time preparing flash statistical
tests if they are not likely to be informative. Take time first to think about your data: essentially a DESCRIPTIVE
process of EXPLORATORY DATA ANALYSIS. What are the data's salient features? Try calculating
simple descriptive statistics, such as mean, mode or range, or, as in this practical, try to group your data into
meaningful categories which can usefully guide further investigation.
This practical sets out to 'make sense' of certain data representative of the climate of the British Isles. Although
we may speak of the British Isles as having a single coherent climate, it is evident from the data displayed in
Table 2 and Figure 2 that there is quite some variability between different parts of the British Isles. The prime
task here is to identify key patterns - or groupings - within these data as a preliminary to an attempt to explain
this variability. As geographers, we have a special interest in groupings which show a coherent spatial pattern
(like phenomena arranged next to each other in space), particularly when this spatial pattern helps us to identify
causal factors. Without looking at Fig. 2, you probably have a good idea of how climate varies across the British
Isles, and what the principle influences on it are. You may wish to decide how you would divide Fig. 2 into climate
groupings before you follow through the procedure described below, and check later to see how closely your results
match your preconceptions.
TECHNIQUE
We will use a graphical procedure of HIERARCHICAL AGGLOMERATIVE classification. This is perhaps
the simplest of a range of similar grouping procedures which are described more fully in Johnston (1976). Anyone
wishing to use this kind of technique at a higher level (e.g. dissertation work) is advised to consult Johnston
for a full discussion of the range of alternative options available.
HIERARCHICAL means that the grouping proceeds one step at a time: in this case, step 1 reduces the initial
16 individuals to one group of two, and 14 individuals (15 groups); step 2 reduces these 15 groups to 14 groups,
etc.. The grouping ends when all individuals are collected into a single group. In this case, 15 steps - or LINKS
- are necessary to reach this stage.
AGGLOMERATIVE means that the classification process works by grouping things together. (Alternative procedures
which work in reverse - by progressively splitting-up a single group - are sometimes used.) Successive joining
of individuals/groups works by joining the two 'NEAREST NEIGHBOURS' at any given stage in the process. The
two nearest neighbours of a data set are the two points which are most similar in terms of the variables being
considered (in this case, temperature and rainfall). Here, similarity/dissimilarity is defined by the STRAIGHT-LINE
DISTANCE between data points defined on the scattergram. E.g. see Fig. 2. The straight-line distance between
Craibstone (Cr) and Leuchars (Le) is 17.5 mm; between Craibstone and Paisley (P), 49 mm. Therefore we can say that
(the climates of) Craibstone and Leuchars are approximately three times as similar as are (the climates of) Craibstone
and Paisley. It is simply the distance between two points which is taken as the measure of difference: a given
distance/difference can be ascribed to:
1. a difference in temperature alone (two data points have the same rainfall, but different temperatures);
2. a difference in rainfall alone (two data points have the same temperature, but different total rainfalls);
3. a difference in both temperature and rainfall together.
The scatterplot (Figs 2 and 3) is drawn so that the range of temperature values within the data set (3.7 oC) takes
up the same straight-line distance as does the range of rainfall values within the data set (640 mm), so giving
equal weight to each measure of climate.
How the classification process works is best illustrated by example:
1. Step 0. To begin, plot the initial data points as a scatterplot. This has been done for you: Fig. 2.
2. Step 1. Calculate the straight-line distance between all pairs of data in the initial data set. 16 points
give rise to 120 data pairs (15 + 14 + 13 +... + 1 = 120). Strictly, this requires 120 separate calculations using
Pythagoras' theorem; however, the graphical technique permits a short-cut here. At each step we are required only
to identify the two nearest neighbours, which usually requires nothing more than a quick visual survey of the scatterplot.
However, if several pairs of data seem to be similarly distant, check alternatives using a ruler.
3. Identify which two data points are the nearest neighbours. In the case of Fig. 2 these are Ca and F, separated
by a distance of 5 mm.
4. Combine these two data points to form a single new group: e.g. Ca and F create a new group denoted AA. The new
group is characterised by a value (co-ordinate) midway between the values of its constituent members. In mathematical
terms this is the mean of the two temperature values [(9.3 oC + 9.5 oC)/2 = 9.4 oC] and the mean of the two rainfall
values [(586 mm + 576 mm)/2 = 581 mm] so the new group AA can be represented by a point with co-ordinates (9.4
oC, 581 mm). In graphical terms, this is the point at the centre of the line joining points Ca and F.
5. Replace the points Ca and F with the single point AA. When using a single hard-copy paper plot erasing data
points requires the judicious use of Tippex, or neat, but clear, crossings-out.
6. Make a note of the link number (1), which pair of data was joined (Ca and F), to create which new group (AA),
with what characteristic values (9.3 oC and 581 mm), and the length of the link involved (5 mm) on a suitable linkage
table (Table 4). This completes step 1.
7. Step 2. The process begins again, using exactly the same procedures as before, but this time with one
less group to worry about - 15 data points, not 16. The nearest neighbours are now An and M, separated by a distance
of 8 mm. These are joined to create a new group, BB, which has characteristic values 10.0 oC, 827 mm (rounding
up the 0.5s). The data points representing An and M are removed from the scatterplot, and replaced by the single
point BB, with co-ordinates 10.0 oC, 827 mm. Details of step 2 are recorded in the linkage table as before.
8. Step 3. This is subtly different. The two nearest neighbours are now Lo (Lowestoft) and the group AA,
comprising Caldecott and Finningley, created at step 1. Lo and AA are joined exactly as before to create the new
group CC, which consists of Lowestoft, Caldecott and Finningley. CC is ascribed the mean characteristics, however,
of the TWO data points, Lo and AA, from which it is formed: i.e. (9.8 oC [Lo] + 9.4 oC [AA])/2 = 9.6 oC
(CC); (584 mm [Lo] + 581 mm [AA])/2 = 583 mm (CC). This is perhaps not the best way to join individuals as one
group - Lowestoft, as one place, is given the same weight as the two places - Caldecott and Finningley - that make
up AA - but it is the simplest, which is why it is used here. Johnston (1976) describes different ways to calculate
the mean characteristics of complex groups (i.e. > 2 members).
9. Similarly, step 4 joins E and BB to create group DD...
10. Continue with this process, losing one data point with each link made, until all data points are collapsed
to give a single group. In this case, the first seven links have been made for you - Fig. 3 and Table 3 - and you
are asked to finish the groupings (i.e. make the last 8 links) from this point forwards.
N.B. At some point, it becomes necessary to join two groups. For instance, see Fig. 3. IF
you were to join EE and FF at step 8, this would create group HH at the centre-point of the line joining data points
EE and FF. The coordinates of this new point (hypothetically 9.3 oC and 633 mm) define the characteristic value
for this group, containing six places: Caldecott, Finningley, Lowestoft, East Malling, Durham and Leuchars (see
Fig. 5). The thing to note is that, by the rules we are using, all points on the scatterplot carry the same status
when forming groups, whether that point represents a single place, or a ready-made group of several places. Please
note that it is not correct to link EE and FF at step 8!
It is also important to note that once a group has been made, it cannot be unmade. Its constituent points disappear
from the analysis, to be replaced by the new point representative of the new group: e.g. at step 1, Ca and F disappear
into group AA.
Fig. 4
This shows a part-complete line graph which plots the rising trend of link length (i.e. grouping points of increasing
dissimilarity) as successive links are made. This can be useful as a guide as to which stage in the process - or
what number of groupings - gives the most useful classification (see below).
Fig. 5
This is a part-complete LINKAGE TREE or DENDROGRAM. As with Fig. 4 it shows the length of each link
made, but it also has the advantage of giving a visual record of the of groupings made. Taking the point representative
of any group (e.g. GG) and tracing the lines backwards as they diverge allows you to identify the sequence of groupings
made, and give what is (hopefully!) a clear indication of which individuals make up which groups at any stage in
the process - information missing from Fig. 4. The linkage tree is constructed from the linkage table (Table 4);
if you cannot see how this is done, ask! Note: the 16 places have been arranged along the x axis in a specific
order which prevents crossings in the complete linkage tree.
'STOPPING RULE'?
The grouping process works so that the number of groups must be reduced to one. However, except in highly unusual
circumstances, it is highly unlikely that a single grouping will give the most useful classification - particularly
in a subject such as geography, which relies for its existence on contrasts in phenomena arranged in space! Too
much information tends to be lost by forcing diverse individuals into a single group. Conversely, too many groups
leaves an abundance of detail which can make it difficult to identify the key problems required to guide further
study. It is clear that a 'middling' number of groups usually offers the best solution. The question is: "How
do we know which number of groups - which step in the grouping procedure - gives the best classification?"
The answer requires the use of some kind of 'STOPPING RULE' which decides on the most useful classification.
You can decide on your stopping rule in advance. It can be quite simple: e.g. "the grouping process ends when
the initial number of groups has been reduced by half" (or two-thirds, or whatever fraction you choose), whereupon
that number of groups forms the appropriate classification. Using this stopping rule, we would process our initial
16 groups to create a classification with 8 different classes. This stopping rule requires just one more link to
be made after step 7 (see Fig. 3). You may wish to consider whether this last link (i.e. step 8) provides a satisfactory
grouping of climate types.
Less rigid a priori stopping rules can be conceived - rules which do not pre-determine the number of groups
which make up the final classification. E.g. "the grouping process ends when the distance required to make
the next link exceeds a specific value - say 25 % - of the distance between the two data points in the initial
data set which are farthest apart". In our example, the two most dissimilar places in the original data set
are Lowestoft and Baltsasound, separated by a distance of 96 mm. Using the 25 % criterion, no groupings would be
accepted if they required a link in excess of 24 mm to make them. Step 7 creates GG with a link distance of 17
mm, which means that by this particular stopping rule, group GG is acceptable. If, however, the next step required
a link of 28 mm to make the next group (28 mm > critical value of 24 mm) group HH would not be acceptable, and
the final choice of groupings to form our classification would be that represented by step 7/Fig. 3. Again, you
may wish to consider whether Fig. 3 represents a useful reduction of the information contained in the initial data
set. Too many groupings? Too few? What would be the effect of using a 50 % maximum distance stopping rule instead?
Alternatively, the stopping rule can be chosen in retrospect, once the grouping procedure has been carried through
to the final step which creates a single group. In this case the sequence of linkages must be examined to identify
a suitable stop. This usually means finding a marked break in the pattern of linkages which is taken to indicate
the most useful number of groupings, and hence the appropriate classification. This is where the line graph of
link number vs. link distance (Fig. 4) and/or the linkage tree (Fig. 5) come in useful. Both provide a visual record
of the distance involved in making successive groupings. Any sharp jump in link distance (a break of slope in the
case of Fig. 4) suggests a possible point to place the stop. The classification accepted is then the number of
groups which immediately precedes this jump. E.g. if the jump in link distance occurs between steps 9 and 10, the
groups accepted as the final classification are those which exist after step 9 is carried out (which, for the present
case, would be 7 groups left).
It is important to emphasise that whereas the grouping procedure is objective (once you decide on a certain set
of grouping rules, you must link points according to those rules, not by what you think should link) the choice
of stopping rule is entirely SUBJECTIVE. What you choose as your stop is your decision. The only general
rule available to guide your choice of stopping rule is a 'rule of thumb' regarding 'practical adequacy': choose
your stopping rule to give what you think will be/is a workable classification. (A "workable classification"
decides upon a number of groups which gives a reasonable compromise between too much information, and too little.)
REFERENCE
Johnston, R. J. 1976. Classification in Geography. [Concepts and Techniques in Modern Geography (CATMOG) 6]. Norwich:
Geo Abstracts. 910.0182 Con 6 HD Ref
WHAT YOU HAVE TO DO (i.e. work for assessment)
1. Complete the hierarchical grouping procedure by making suitable, successive changes to Fig. 3. As you
progress, fill in the relevant details on Table 4, and complete Figs 4 and 5. Reduce your data to a single grouping
regardless of whether or not you choose an a priori stopping rule.
2.a In your judgement, which step/number of groups gives the most useful classification of the British Isles' climate?
Justify your choice, including a statement of the criterion/criteria you use as a 'stopping rule'. (N.B. 'Groups'
which contain just the one individual can form perfectly acceptable divisions of a classification.)
2.b How would you describe each individual grouping of your classification? This requires an appropriate title
and or summary phrase which encapsulates the key characteristics of each of your chosen climate groupings: e.g.
"Type A. South-east sector. Continental-type climate. Extreme. Cold and dry.".
2.c Map your choice of classification by making suitable additions to Fig. 1. This requires you to depict the spatial
distribution of the different climate groupings which you identify by drawing the relevant boundaries on Fig. 1.
Remember to include a title, key (if appropriate) and suitable shading and/or annotations which make the distinction
between different groupings clear.
3.a How would you characterise the climate of the British Isles in general? (Briefly) What key factors does this
general climate reflect?
b. How would you account for the contrasts in climate shown in your chosen classification?
4. Using your map of climate classification, your answer to Qu. 3b and the information contained in Tables 1 and
2, and Figs 2 and 6, comment on the relationship between spatial (real world) distance and 'climate distance'.
Hint: imagine a line of best fit (trendline) through the data scatter, and consider the significance of 'outlying'
points such as (109, 64) and (335, 11).
[Note: CORRELATION COEFFICIENT. This puts a numerical value on the strength of the (linear) relationship between
data sets of paired variables. The idea of 'strength of relationship' is best grasped by thinking of the data set
out as a scatterplot, as in Fig. 6. If the data conform closely to a straight line, we can talk of a strong correlation;
if the data are widely scattered, making it difficult to fit a straight line to the data, the correlation is weak.
For variables which are positively related (i.e. y increases as x increases) values of the correlation coefficient
fall between 0.0 and 1.0. A correlation coefficient of 1.0 indicates that the data fall exactly on a straight line;
a correlation coefficient of 0.0 indicates that the data show perfectly random scatter, and no linear relationship
exists between the two variables.]
5. What practical problems can you envisage if you were to use this hierarchical grouping procedure to classify
a data set which contained substantially more than 16 data pairs? How might you modify the grouping rules to reduce
this difficulty? What would be the costs in terms of the potential loss of information if you were to introduce
these modified rules?
6.a How useful do you consider the two measures of climate used as the basis of this classification to be? For
instance: do you have sufficient information here to assess the impact of global warming on the climate of the
British Isles?
6.b Can you suggest how you might include a third measure of climate in the grouping procedure? Hint: you might
like to consider graphical or mathematical tricks to do this. Thinking triangles should help in either case!
Please submit attempted answers to each of the questions above, along with appropriate figures. Each question,
or part-question, should require no more than a couple of paragraphs at most. Work should be deposited in the post-box
within one week of the practical date. Please remember to attach and correctly complete a cover sheet.