GG2003: Environmental Systems
Practical 1
GROUPING PROCEDURES AND CLASSIFICATION
Case Study: The Climate Of The British Isles



PURPOSE

The purpose of this practical is:

1. To highlight the importance of looking at, and thinking about, your data before trying to do too much with it: what we might call exploratory data analysis.

2. To introduce you to a specific technique - 'nearest neighbour' linkage - which enables us to create classifications of seemingly diverse phenomena.

3. To use this technique to explore aspects of climatic variability within the British Isles.


List of Figures

Fig. 1: Map of the British Isles, showing the location of the 16 stations from which the initial climate data are taken.

Fig. 2: Scatterplot showing the initial 16 data pairs.

Fig. 3: Revised scatterplot showing situation after 7 groupings made.

Fig. 4: Part-complete linkage graph.

Fig. 5: Part-complete linkage tree, showing progress of the grouping procedure.

Fig. 6: Scatterplot of spatial distance vs. 'climate distance' for the initial 120 data pairs.


List of Tables

Table 1: Distance matrix, showing distances between the initial data pairs.

Table 2: The initial data set.

Table 3: Group status after 7 groupings have been made.

Table 4: Part-complete linkage table.


You are required to amend/complete as instructed the figures and tables identified in bold type. These are supplied on paper, and must be submitted for marking alongside your written work. Other figures and tables are not required for marking purposes.


INTRODUCTION


The activities which we usually think of as "science" - which would include much of what usually passes as "geography" - try to explain why the real world is like it is. This usually involves trying to develop certain general ideas which seem to work across a large number of different examples. These workable, general ideas are usually elevated to the status of 'theory' or 'law'. However, laws and theories rarely spring to mind without having to put a lot of work in first (although this is not to say that flashes of inspiration stimulated by the likes of falling apples do not play an important part in the process of scientific discovery!). Very often we start with only a vague idea of what we are dealing with (e.g. some crude idea of 'climate'), in which case the primary problem is not the analysis of what meaningful relationships exist within and between certain categories of phenomena, but what are the different categories into which we put phenomena in the first place?

This process of allocating individual phenomena to broader conceptual categories - i.e. CLASSIFICATION - is often seen as the second key stage of scientific enquiry (the first stage is that of observation which gives rise to the data to be classified). By grouping like phenomena into broad conceptual categories, and thereby separating dissimilar phenomena, certain patterns may be revealed which enable us to derive theories about the workings of the real world. Sometimes this classification procedure is intuitive: we automatically slot data/phenomena into different categories without thinking too deeply about it - although this runs the risk of creating groups which have no real justification (e.g. it might to dangerous to assume that Scotland's climate is distinct from the rest of the British Isles...). Other data sets, however, may defy intuitive classification: e.g. the jumble of rock debris dumped at the edges of a glacier can often require careful examination before different debris types (e.g. debris carried on top of the ice, at the base of the ice, or by subglacial streams) can be separated and identified. In this kind of situation, relatively formal procedures, such as the one we will use in this practical, are likely to be useful.


So, why are we doing this?

The key point to take away - and the purpose of this practical - is that it is usually a good idea to look at your data, and try to make sense of it, before you try to do anything with it. Do not waste time preparing flash statistical tests if they are not likely to be informative. Take time first to think about your data: essentially a DESCRIPTIVE process of EXPLORATORY DATA ANALYSIS. What are the data's salient features? Try calculating simple descriptive statistics, such as mean, mode or range, or, as in this practical, try to group your data into meaningful categories which can usefully guide further investigation.

This practical sets out to 'make sense' of certain data representative of the climate of the British Isles. Although we may speak of the British Isles as having a single coherent climate, it is evident from the data displayed in Table 2 and Figure 2 that there is quite some variability between different parts of the British Isles. The prime task here is to identify key patterns - or groupings - within these data as a preliminary to an attempt to explain this variability. As geographers, we have a special interest in groupings which show a coherent spatial pattern (like phenomena arranged next to each other in space), particularly when this spatial pattern helps us to identify causal factors. Without looking at Fig. 2, you probably have a good idea of how climate varies across the British Isles, and what the principle influences on it are. You may wish to decide how you would divide Fig. 2 into climate groupings before you follow through the procedure described below, and check later to see how closely your results match your preconceptions.


TECHNIQUE
We will use a graphical procedure of HIERARCHICAL AGGLOMERATIVE classification. This is perhaps the simplest of a range of similar grouping procedures which are described more fully in Johnston (1976). Anyone wishing to use this kind of technique at a higher level (e.g. dissertation work) is advised to consult Johnston for a full discussion of the range of alternative options available.

HIERARCHICAL means that the grouping proceeds one step at a time: in this case, step 1 reduces the initial 16 individuals to one group of two, and 14 individuals (15 groups); step 2 reduces these 15 groups to 14 groups, etc.. The grouping ends when all individuals are collected into a single group. In this case, 15 steps - or LINKS - are necessary to reach this stage.

AGGLOMERATIVE means that the classification process works by grouping things together. (Alternative procedures which work in reverse - by progressively splitting-up a single group - are sometimes used.) Successive joining of individuals/groups works by joining the two 'NEAREST NEIGHBOURS' at any given stage in the process. The two nearest neighbours of a data set are the two points which are most similar in terms of the variables being considered (in this case, temperature and rainfall). Here, similarity/dissimilarity is defined by the STRAIGHT-LINE DISTANCE between data points defined on the scattergram. E.g. see Fig. 2. The straight-line distance between Craibstone (Cr) and Leuchars (Le) is 17.5 mm; between Craibstone and Paisley (P), 49 mm. Therefore we can say that (the climates of) Craibstone and Leuchars are approximately three times as similar as are (the climates of) Craibstone and Paisley. It is simply the distance between two points which is taken as the measure of difference: a given distance/difference can be ascribed to:

1. a difference in temperature alone (two data points have the same rainfall, but different temperatures);
2. a difference in rainfall alone (two data points have the same temperature, but different total rainfalls);
3. a difference in both temperature and rainfall together.

The scatterplot (Figs 2 and 3) is drawn so that the range of temperature values within the data set (3.7 oC) takes up the same straight-line distance as does the range of rainfall values within the data set (640 mm), so giving equal weight to each measure of climate.


How the classification process works is best illustrated by example:

1. Step 0. To begin, plot the initial data points as a scatterplot. This has been done for you: Fig. 2.

2. Step 1. Calculate the straight-line distance between all pairs of data in the initial data set. 16 points give rise to 120 data pairs (15 + 14 + 13 +... + 1 = 120). Strictly, this requires 120 separate calculations using Pythagoras' theorem; however, the graphical technique permits a short-cut here. At each step we are required only to identify the two nearest neighbours, which usually requires nothing more than a quick visual survey of the scatterplot. However, if several pairs of data seem to be similarly distant, check alternatives using a ruler.

3. Identify which two data points are the nearest neighbours. In the case of Fig. 2 these are Ca and F, separated by a distance of 5 mm.

4. Combine these two data points to form a single new group: e.g. Ca and F create a new group denoted AA. The new group is characterised by a value (co-ordinate) midway between the values of its constituent members. In mathematical terms this is the mean of the two temperature values [(9.3 oC + 9.5 oC)/2 = 9.4 oC] and the mean of the two rainfall values [(586 mm + 576 mm)/2 = 581 mm] so the new group AA can be represented by a point with co-ordinates (9.4 oC, 581 mm). In graphical terms, this is the point at the centre of the line joining points Ca and F.

5. Replace the points Ca and F with the single point AA. When using a single hard-copy paper plot erasing data points requires the judicious use of Tippex, or neat, but clear, crossings-out.

6. Make a note of the link number (1), which pair of data was joined (Ca and F), to create which new group (AA), with what characteristic values (9.3 oC and 581 mm), and the length of the link involved (5 mm) on a suitable linkage table (Table 4). This completes step 1.

7. Step 2. The process begins again, using exactly the same procedures as before, but this time with one less group to worry about - 15 data points, not 16. The nearest neighbours are now An and M, separated by a distance of 8 mm. These are joined to create a new group, BB, which has characteristic values 10.0 oC, 827 mm (rounding up the 0.5s). The data points representing An and M are removed from the scatterplot, and replaced by the single point BB, with co-ordinates 10.0 oC, 827 mm. Details of step 2 are recorded in the linkage table as before.

8. Step 3. This is subtly different. The two nearest neighbours are now Lo (Lowestoft) and the group AA, comprising Caldecott and Finningley, created at step 1. Lo and AA are joined exactly as before to create the new group CC, which consists of Lowestoft, Caldecott and Finningley. CC is ascribed the mean characteristics, however, of the TWO data points, Lo and AA, from which it is formed: i.e. (9.8 oC [Lo] + 9.4 oC [AA])/2 = 9.6 oC (CC); (584 mm [Lo] + 581 mm [AA])/2 = 583 mm (CC). This is perhaps not the best way to join individuals as one group - Lowestoft, as one place, is given the same weight as the two places - Caldecott and Finningley - that make up AA - but it is the simplest, which is why it is used here. Johnston (1976) describes different ways to calculate the mean characteristics of complex groups (i.e. > 2 members).

9. Similarly, step 4 joins E and BB to create group DD...

10. Continue with this process, losing one data point with each link made, until all data points are collapsed to give a single group. In this case, the first seven links have been made for you - Fig. 3 and Table 3 - and you are asked to finish the groupings (i.e. make the last 8 links) from this point forwards.

N.B. At some point, it becomes necessary to join two groups. For instance, see Fig. 3. IF you were to join EE and FF at step 8, this would create group HH at the centre-point of the line joining data points EE and FF. The coordinates of this new point (hypothetically 9.3 oC and 633 mm) define the characteristic value for this group, containing six places: Caldecott, Finningley, Lowestoft, East Malling, Durham and Leuchars (see Fig. 5). The thing to note is that, by the rules we are using, all points on the scatterplot carry the same status when forming groups, whether that point represents a single place, or a ready-made group of several places. Please note that it is not correct to link EE and FF at step 8!

It is also important to note that once a group has been made, it cannot be unmade. Its constituent points disappear from the analysis, to be replaced by the new point representative of the new group: e.g. at step 1, Ca and F disappear into group AA.


Fig. 4

This shows a part-complete line graph which plots the rising trend of link length (i.e. grouping points of increasing dissimilarity) as successive links are made. This can be useful as a guide as to which stage in the process - or what number of groupings - gives the most useful classification (see below).


Fig. 5


This is a part-complete LINKAGE TREE or DENDROGRAM. As with Fig. 4 it shows the length of each link made, but it also has the advantage of giving a visual record of the of groupings made. Taking the point representative of any group (e.g. GG) and tracing the lines backwards as they diverge allows you to identify the sequence of groupings made, and give what is (hopefully!) a clear indication of which individuals make up which groups at any stage in the process - information missing from Fig. 4. The linkage tree is constructed from the linkage table (Table 4); if you cannot see how this is done, ask! Note: the 16 places have been arranged along the x axis in a specific order which prevents crossings in the complete linkage tree.


'STOPPING RULE'?

The grouping process works so that the number of groups must be reduced to one. However, except in highly unusual circumstances, it is highly unlikely that a single grouping will give the most useful classification - particularly in a subject such as geography, which relies for its existence on contrasts in phenomena arranged in space! Too much information tends to be lost by forcing diverse individuals into a single group. Conversely, too many groups leaves an abundance of detail which can make it difficult to identify the key problems required to guide further study. It is clear that a 'middling' number of groups usually offers the best solution. The question is: "How do we know which number of groups - which step in the grouping procedure - gives the best classification?" The answer requires the use of some kind of 'STOPPING RULE' which decides on the most useful classification.

You can decide on your stopping rule in advance. It can be quite simple: e.g. "the grouping process ends when the initial number of groups has been reduced by half" (or two-thirds, or whatever fraction you choose), whereupon that number of groups forms the appropriate classification. Using this stopping rule, we would process our initial 16 groups to create a classification with 8 different classes. This stopping rule requires just one more link to be made after step 7 (see Fig. 3). You may wish to consider whether this last link (i.e. step 8) provides a satisfactory grouping of climate types.

Less rigid a priori stopping rules can be conceived - rules which do not pre-determine the number of groups which make up the final classification. E.g. "the grouping process ends when the distance required to make the next link exceeds a specific value - say 25 % - of the distance between the two data points in the initial data set which are farthest apart". In our example, the two most dissimilar places in the original data set are Lowestoft and Baltsasound, separated by a distance of 96 mm. Using the 25 % criterion, no groupings would be accepted if they required a link in excess of 24 mm to make them. Step 7 creates GG with a link distance of 17 mm, which means that by this particular stopping rule, group GG is acceptable. If, however, the next step required a link of 28 mm to make the next group (28 mm > critical value of 24 mm) group HH would not be acceptable, and the final choice of groupings to form our classification would be that represented by step 7/Fig. 3. Again, you may wish to consider whether Fig. 3 represents a useful reduction of the information contained in the initial data set. Too many groupings? Too few? What would be the effect of using a 50 % maximum distance stopping rule instead?

Alternatively, the stopping rule can be chosen in retrospect, once the grouping procedure has been carried through to the final step which creates a single group. In this case the sequence of linkages must be examined to identify a suitable stop. This usually means finding a marked break in the pattern of linkages which is taken to indicate the most useful number of groupings, and hence the appropriate classification. This is where the line graph of link number vs. link distance (Fig. 4) and/or the linkage tree (Fig. 5) come in useful. Both provide a visual record of the distance involved in making successive groupings. Any sharp jump in link distance (a break of slope in the case of Fig. 4) suggests a possible point to place the stop. The classification accepted is then the number of groups which immediately precedes this jump. E.g. if the jump in link distance occurs between steps 9 and 10, the groups accepted as the final classification are those which exist after step 9 is carried out (which, for the present case, would be 7 groups left).

It is important to emphasise that whereas the grouping procedure is objective (once you decide on a certain set of grouping rules, you must link points according to those rules, not by what you think should link) the choice of stopping rule is entirely SUBJECTIVE. What you choose as your stop is your decision. The only general rule available to guide your choice of stopping rule is a 'rule of thumb' regarding 'practical adequacy': choose your stopping rule to give what you think will be/is a workable classification. (A "workable classification" decides upon a number of groups which gives a reasonable compromise between too much information, and too little.)


REFERENCE

Johnston, R. J. 1976. Classification in Geography. [Concepts and Techniques in Modern Geography (CATMOG) 6]. Norwich: Geo Abstracts. 910.0182 Con 6 HD Ref


WHAT YOU HAVE TO DO (i.e. work for assessment)
1. Complete the hierarchical grouping procedure by making suitable, successive changes to Fig. 3. As you progress, fill in the relevant details on Table 4, and complete Figs 4 and 5. Reduce your data to a single grouping regardless of whether or not you choose an a priori stopping rule.

2.a In your judgement, which step/number of groups gives the most useful classification of the British Isles' climate? Justify your choice, including a statement of the criterion/criteria you use as a 'stopping rule'. (N.B. 'Groups' which contain just the one individual can form perfectly acceptable divisions of a classification.)

2.b How would you describe each individual grouping of your classification? This requires an appropriate title and or summary phrase which encapsulates the key characteristics of each of your chosen climate groupings: e.g. "Type A. South-east sector. Continental-type climate. Extreme. Cold and dry.".

2.c Map your choice of classification by making suitable additions to Fig. 1. This requires you to depict the spatial distribution of the different climate groupings which you identify by drawing the relevant boundaries on Fig. 1. Remember to include a title, key (if appropriate) and suitable shading and/or annotations which make the distinction between different groupings clear.

3.a How would you characterise the climate of the British Isles in general? (Briefly) What key factors does this general climate reflect?

b. How would you account for the contrasts in climate shown in your chosen classification?

4. Using your map of climate classification, your answer to Qu. 3b and the information contained in Tables 1 and 2, and Figs 2 and 6, comment on the relationship between spatial (real world) distance and 'climate distance'. Hint: imagine a line of best fit (trendline) through the data scatter, and consider the significance of 'outlying' points such as (109, 64) and (335, 11).

[Note: CORRELATION COEFFICIENT. This puts a numerical value on the strength of the (linear) relationship between data sets of paired variables. The idea of 'strength of relationship' is best grasped by thinking of the data set out as a scatterplot, as in Fig. 6. If the data conform closely to a straight line, we can talk of a strong correlation; if the data are widely scattered, making it difficult to fit a straight line to the data, the correlation is weak. For variables which are positively related (i.e. y increases as x increases) values of the correlation coefficient fall between 0.0 and 1.0. A correlation coefficient of 1.0 indicates that the data fall exactly on a straight line; a correlation coefficient of 0.0 indicates that the data show perfectly random scatter, and no linear relationship exists between the two variables.]

5. What practical problems can you envisage if you were to use this hierarchical grouping procedure to classify a data set which contained substantially more than 16 data pairs? How might you modify the grouping rules to reduce this difficulty? What would be the costs in terms of the potential loss of information if you were to introduce these modified rules?

6.a How useful do you consider the two measures of climate used as the basis of this classification to be? For instance: do you have sufficient information here to assess the impact of global warming on the climate of the British Isles?

6.b Can you suggest how you might include a third measure of climate in the grouping procedure? Hint: you might like to consider graphical or mathematical tricks to do this. Thinking triangles should help in either case!

Please submit attempted answers to each of the questions above, along with appropriate figures. Each question, or part-question, should require no more than a couple of paragraphs at most. Work should be deposited in the post-box within one week of the practical date. Please remember to attach and correctly complete a cover sheet.