Saturday, 14 June 2014

Powerful percentages

Numbers are powerful, statistics are powerful, but they must be used correctly and responsibly. Leaders need to use data to help take decisions and measure progress, but leaders also need to make sure that they know where limitations creep into data, particularly when it's processed into summary figures.

This links quite closely to this post by David Didau (@Learningspy) where he discusses availability bias - i.e. being biased because you're using the data that is available rather than thinking about it more deeply.

As part of this there is an important misuse of percentages that as a maths teacher I feel the need to highlight... basically when you turn raw numbers into percentages it can add weight to them, but sometimes this weight is undeserved...

Percentages can end up being discrete measures dressed up as continuous
Quick reminder of GCSE data types - Discrete data is in chunks, it can't take values between particular points. Classic examples might be shoe sizes where there is no measure between size 9 or size 10, or favourite flavours of crisps where there is no mid point between Cheese & Onion or Smoky Bacon.

Continuous data can have sub divisions inserted between them, for example a measure of height could be in metres, centimetres, millimetres and so on - it can keep on being divided.

The problem with percentages is that they look continuous - you can quote 27%, 34.5%, 93.2453%. However the data used to calculate the percentage actually imposes discrete limits to the possible outcome. A sample of 1 can only have a result of 0% or 100%, a sample of 2 can only result in 0%, 50% or 100%, 3 can only give 0%, 33.3%, 66.7% or 100%, and so on. Even with 200 data points you can only have 201 separate percentage value outputs - it's not really continuous unless you get to massive samples.

It LOOKS continuous and is talked about like a continuous measure, but it is actually often discrete and determined by the sample that you are working with.

Percentages as discrete data makes setting targets difficult for small groups
Picture a school that sets an overall target that at least 80% of students in a particular category (receipt of pupil premium, SEN needs, whatever else) are expected to meet or exceed expected progress.

In this hypothetical school there are three equivalent classes, let's call them A, B and C. In class A we can calculate that 50% of these students are making expected progress; in class B it's 100%, and in class C it's 0%. On face value Class A is 30% behind target, B is 20% ahead and C is 80% behind, but that's completely misleading...

Class A has two students in this category, one is making expected progress, the other isn't. As such it's impossible to meet the 80% target in this class - the only options are 0%, 50% or 100%. If the whole school target at 80% accepts that some students may not reach expected progress then by definition you have to accept that 50% might be on target for this specific class. You might argue that 80% is closer to 100% so that should be the target for this class, but that means that this teacher as to achieve 100% where the whole school is only aiming at 80%! The school has room for error but this class doesn't! To suggest that this teacher is underperforming because they haven't hit 100% is unfair. Here the percentage has completely confused the issue, when what's really important is whether these 2 individuals are learning as well as they can?

Class B and C might each have only one student in this category. But it doesn't mean that the teacher of class B is better than that of class C. In class B the student's category happens to have no significant impact on their learning in that subject, they progress alongside the rest of the class with no issues, with no specific extra input from the teacher. In class C the student is also a young carer and misses extended periods from school; when present they work well but there are gaps in their knowledge due to absences that even the best teacher will struggle to fill. To suggest that either teacher is more successful than the other on the basis of this data is completely misleading as the detailed status of individual students is far more significant.

What this is intended to illustrate is that taking a target for a large population of students and applying it to much smaller subsets can cause real issues. Maybe the 80% works at a whole school level, but surely it makes much more sense at a class level to talk about the individual students rather than reducing them to a misleading percentage?

Percentage amplifies small populations into large ones
Simply because percent means "per hundred" we start to picture large numbers. When we state that 67% of books reviewed have been marked in the last two weeks it conjures up images of 67 books out of 100. However that statistic could have been arrived at having only reviewed 3 books, 2 of which had been marked recently. The percentage give no indication of the true sample size, and therefore 67% could hide the fact that the next step better could be 100%!

If the following month the same measure is quoted as having jumped to 75% it looks like a big improvement, but it could simply be 9 out of 12 this time, compared to 8 out of 12 the previous month.  Arithmetically the percentages are correct (given rounding), but the apparent step change from 67% to 75% is actually far less impressive when described as 8/12 vs 9/12. As a percentage it suggests a big move in the population; as a fraction it means only one more meeting the measure.

You can get a similar issue if a school is grading lessons/teaching and reports 72% good or better in one round of reviews, and then sees 84% in the next. (Many schools are still doing this type of grading and summary, I'm not going to debate the rights and wrongs here - there are other places for that). However the 72% is the result of 18 good or better out of 25 seen, the 84% is the result of 21 out of 25. So the 12% point jump is due to just 3 teachers flipping from one grade to the next.

Basically when your population is below 100 an individual piece of data is worth more than 1% and it's vital not to forget this. Quoting a small population as a percentage amplifies any apparent changes, and this effect increases as the population size shrinks. The smaller your population the bigger the amplification. So with a small population a positive change looks more positive as a percentage, and a negative change looks more negative as a percentage.

Being able to calculate a percentage doesn't mean you should
I guess to some extent I'm talking about an aspect of numeracy that gets overlooked. The view could be that if you know the arithmetic method for calculating a percentage then so long as you do that calculation correctly then the numbers are right. Logic follows that if the numbers are right then any decisions based on them must be right too. But this doesn't work.

The numbers might be correct but the decision may be flawed. Comparing this to a literacy example might help. I can write a sentence that is correct grammatically, but that does not mean the sentence must be true. The words can be spelled correctly, in the correct order and punctuation might be flawless. However the meaning of the sentence could be completely incorrect. (I appreciate that there might be some irony in that I may have made unwitting errors in this sentence about grammar - corrections welcome!)

For percentage calculations then the numbers may well be correct arithmetically but we always need to check the nature of the data that was used to generate these numbers and be aware of the limitations to the data. Taking decisions while ignoring these limitations significantly harms the quality of the decision.

Other sources of confusion
None of the above deals with variability or reliability in the measures used as part of your sample, but that's important too. If your survey of books could have given a slightly different result if you'd chosen different books, different students or different teachers then there is an inherent lack of repeatability to the data. If you're reporting a change between two tests then anything within test to test variation simply can't be assumed to be a real difference. Apparent movements of 50% or more could be statistically insignificant if the process used to collect the data is unreliable. Again the numbers may be arithmetically sound, but the statistical conclusion may not be.

Draw conclusions with caution
So what I'm really trying to say is that the next time someone starts talking about percentages try to look past the data and make sure that it makes sense to summarise it as a percentage. Make sure you understand what discrete limitations the population size has imposed, and try to get a feel for how sensitive the percentage figures are to small changes in the results.

By all means use percentages, but use them consciously with knowledge of their limitations.


As always - all thoughts/comments welcome...

Saturday, 7 June 2014

RAG123 is not the same as traffic lights

I've written regularly about RAG123 since November 2013 and since starting it as an initial trial in November I still view it as the single most important thing I've discovered as a teacher. It's now absolutely central to my teaching practice, but I do fear that at times people misunderstand what RAG123 is all about. They see the colours and they decide it is just another version of traffic lighting or thumbs up/across/down AFL. I'm sure it gets dismissed as "lazy marking", but the reality is that it is much, much more than marking.

As an example of this uncertainty of RAG123 at a surface level without really understanding the depth I was recently directed to the Ofsted document "Mathematics made to measure" found here. I'd read this document some time ago and it is certainly a worthwhile read for anyone in a maths department, particularly leading/managing the subject, but it may well provide useful thoughts to those with other specialisms. There is a section (paragraphs 88-99) that are presented under the subheading "Marking: the importance of getting it right" - it was suggested to me that RAG123 fell foul of the good practice recommended in these paragraphs, even explicitly criticised as traffic lighting and as such isn't a good approach to follow.

Having read the document again I actually see RAG123 as fully in line with the recommendations of good practice in the Ofsted document and I'd like to try and explain why....

The paragraphs below (incl paragraph numbers) are cut & pasted directly from the Ofsted document (italics), my responses are shown in bold:

88. Inconsistency in the quality, frequency and usefulness of teachers’ marking is a 
perennial concern. The best marking noted during the survey gave pupils 
insight into their errors, distinguishing between slips and misunderstanding, and 
pupils took notice of and learnt from the feedback. Where work was all correct, 
a further question or challenge was occasionally presented and, in the best 
examples, this developed into a dialogue between teacher and pupil. 
RAG123 gives a consistent quality, and frequency to marking. Errors and misunderstandings seen in a RAG123 review can be addressed either in marking or through adjustments to the planning for the next lesson. The speed of turnaround between work done, marking done/feedback given, pupil response, follow up review by teacher means that real dialogue can happen in marking.

89. More commonly, comments written in pupils’ books by teachers related either 
to the quantity of work completed or its presentation. Too little marking 
indicated the way forward or provided useful pointers for improvement. The 
weakest practice was generally in secondary schools where cursory ticks on 
most pages showed that the work had been seen by the teacher. This was 
occasionally in line with a department’s marking policy, but it implied that work 
was correct when that was not always the case. In some instances, pupils’ 
classwork was never marked or checked by the teacher. As a result, pupils can 
develop very bad habits of presentation and be unclear about which work is 
correct.
With RAG123 ALL work is seen by the teacher - there is no space for bad habits to develop or persist. While it can be that the effort grading could be linked to quantity or presentation it should also be shaped by the effort that the teacher observed in the lesson. Written comments/corrections may not be present in all books but corrections can be applied in the next lesson without the need for the teacher to write loads down. This can be achieved in various ways, from 1:1 discussion to changing the whole lesson plan.

90. A similar concern emerged around the frequent use of online software which 
requires pupils to input answers only. Although teachers were able to keep 
track of classwork and homework completed and had information about 
stronger and weaker areas of pupils’ work, no attention was given to how well 
the work was set out, or whether correct methods and notation were used.
Irrelevant to RAG123

91. Teachers may have 30 or more sets of homework to mark, so looking at the 
detail and writing helpful comments or pointers for the way forward is time 
consuming. However, the most valuable marking enables pupils to overcome 
errors or difficulties, and deepen their understanding.
Combining RAG123 with targeted follow up/DIRT does exactly this in an efficient way.


Paragraphs 92 & 93 simply refer to examples given in the report and aren't relevant here.

94. Some marking did not distinguish between types of errors and, occasionally, 
correct work was marked as wrong.
Always a risk in all marking, RAG123 is not immune, but neither is any other marking. However given that RAG123 only focuses on a single lesson's work the quantity is smaller so there is a greater change that variations in student's work will be seen and addressed.
95. At other times, teachers gave insufficient attention to correcting pupils’ 
mathematical presentation, for instance, when 6 ÷ 54 was written incorrectly 
instead of 54 ÷ 6, or the incorrect use of the equals sign in the solution of an 
equation.
Again a risk in all marking and RAG123 is not immune, but it does give the opportunity for frequent and repeated corrections/highlighting of these errors so that they don't become habits.

96. Most marking by pupils of their own work was done when the teacher read out 
the answers to exercises or took answers from other members of the class. 
Sometimes, pupils were expected to check their answers against those in the 
back of the text book. In each of these circumstances, attention was rarely paid 
to the source of any errors, for example when a pupil made a sign error while 
expanding brackets and another omitted to write down the ‘0’ place holder in a 
long multiplication calculation. When classwork was not marked by the teacher 
or pupil, mistakes were unnoticed.
With RAG123 ALL work is seen by the teacher - they can look at incorrect work and determine what the error was, either addressing it directly with the student or if it is widespread taking action at whole class level.

97. The involvement of pupils in self-assessment was a strong feature of the most 
effective assessment practice. For instance, in one school, Year 4 pupils 
completed their self-assessments using ‘I can …’ statements and selected their 
own curricular targets such as ‘add and subtract two-digit numbers mentally’ 
and ‘solve 1 and 2 step problems’. Subsequent work provided opportunities for 
pupils to work on these aspects. 
The best use of RAG123 asks students to self assess with a reason for their rating. Teachers can review/respond and shape these self assessments in a very dynamic way due to the speed of turnaround. It also gives a direct chance to follow up by linking to DIRT

98. An unhelpful reliance on self-assessment of learning by pupils was prevalent in 
some of the schools. In plenary sessions at the end of lessons, teachers 
typically revisited the learning objectives, and asked pupils to assess their own 
understanding, often through ‘thumbs’, ‘smiley faces’ or traffic lights. However, 
such assessment was often superficial and may be unreliable.
Assessment of EFFORT as well as understanding in RAG123 is very different to these single dimension assessments. I agree that sometimes the understanding bit is unreliable. However with RAG123 the teacher reviews and changes the pupil's RAG123 rating based on the work done/seen in class. As such it becomes more accurate once reviewed. Also the reliability is often improved by by asking students to explain why they deserve that rating. The effort bit is vital though... If a student is trying as hard as they can (G) then it is the teacher's responsibility to ensure that they gain understanding. If a student is only partially trying (A) then the teacher's impact will be limited. If a student is not trying at all (R) then even the most awesome teacher will not be able to improve their understanding. By highlighting and taking action on the effort side it emphasises the student's key input to the learning process. While traffic lights may very well be ineffective as a single shot self assessment of understanding, when used as a metaphor for likely progress given RAG effort levels then Green certainly is Go, and Red certainly is stop.

99. Rather than asking pupils at the end of the lesson to indicate how well they had 
met learning objectives, some effective teachers set a problem which would 
confirm pupils’ learning if solved correctly or pick up any remaining lack of 
understanding. One teacher, having discussed briefly what had been learnt with 
the class, gave each pupil a couple of questions on pre-prepared cards. She 
took the cards in as the pupils left the room and used their answers to inform 
the next day’s lesson planning. Very occasionally, a teacher used the plenary 
imaginatively to set a challenging problem with the intention that pupils should 
think about it ready for the start of new learning in the next lesson. 
This is an aspect of good practice that can be applied completely alongside RAG123, in fact the "use to inform the next day's lesson planning" is something that is baked in with daily RAG123 - by knowing exactly the written output from one lesson you are MUCH more likely to take account of it in the next one.

So there you have it - I see RAG123 as entirely in line with all the aspects of best practice identified here. Don't let the traffic light wording confuse you - RAG123 as deployed properly isn't anything like a single dimension traffic light self assessment - it just might share the colours. If you don't like the colours and can't get past that bit then define it as ABC123 instead - it'll still be just as effective and it'll still be the best thing you've done in teaching!

All comments welcome as ever!

Reflecting on reflections

Reflecting is hard, really hard! It requires an honesty with yourself, an ability to take a step back from what you've done (that you have a personal attachment to) and to think deeply about how successful you've been. Ideally it should also involve some diagnosis on why you have/haven't been successful, and what you might do differently the next time you face a similar situation.

Good reflection is really high order thinking
If you consider where the skills required or the type of thinking for reflection lie in Bloom's taxonomy then it's the top end, high order thinking. You have to analyse and evaluate your performance, and then create ideas on how to improve.
Picture from http://en.wikipedia.org/wiki/Bloom's_taxonomy
Some people don't particularly like Blooms and might want to lob rocks at anything that refers to it. If you'd prefer to use an alternative taxonomy like SOLO (see here) then we're still talking the higher end Relational and Extended Abstract type of thinking. Anyone involved in reflection needs to make links between various areas of understanding, and ideally extend this into a what if situation for the future. Basically use whatever taxonomy of thinking you like and reflection/metacognition is right at the top in terms of difficulty.

The reason I am talking about this is that one of the things I keep seeing on twitter and also in observation feedback, work scrutiny evaluations and so on or are comments about poor quality self assessment & reflections from students.

Sometimes this is a teacher getting frustrated when students asked to reflect just end up writing comments like "I understood it," "I didn't get it" or "I did ok." Other times it is someone reviewing books that might suggest that the student's reflections don't indicate that they know what they need to do to improve.

It often crops up, and one of the ways I most often hear about it is when someone is first trying out RAG123 marking (Not heard of RAG123? - see here, here and then any of these). This structure for marking gives so many opportunities for self assessment and dialogue that the teacher sees lots of relatively poor reflective comments in one go and finds it frustrating.

Now having thought about the type of thinking required for good reflection is it a real surprise that a lot of students struggle? To ask a group to reflect is pushing them to a really high level of thought. Asking the question is completely valid, it's good to pose high order questions, but we really shouldn't be surprised if we get low order answers even from very able students, and particularly from weaker students. Some may not yet have the cognitive capacity to make a high quality response, for others it might be a straight vocabulary/literacy issue - students can't talk about something coherently unless they have the appropriate words at their disposal.

Is it just students?
The truth is that many adults struggle to reflect well. Some people struggle to see how good things actually were because they get hung up on the bad things. Others struggle to see the bad bits because they are distracted by the good bits. Even then many will struggle to do the diagnosis side and look for ways to improve. It's difficult to recognise flaws in yourself, and often even harder to come up with an alternative method that will improve things. If we all found it easy then the role of coaches and mentors would be redundant.

As part of thinking about how well our students are reflecting perhaps we should all take a little time to think about how good we are at reflecting on our own practice? How honest are we with ourselves? How objective are we? How constructive are we in terms of making and applying changes as a result of our reflections?

Don't stop just because it's difficult
Vitally just because students struggle to reflect in a coherent or high order way doesn't mean we should stop asking them to reflect. But we shouldn't be foolish enough to expect a spectacularly insightful self assessment from students the first time they try it. As with any cognitive process we should give them support to help them to structure their reflections. This support is the same kind of scaffolding that may be needed for any other learning:

Model it: Show them some examples of good reflection. Perhaps even demonstrate it in front of the class by reflecting on the lesson you've just taught?
Give a foothold: Sentences are easier to finish than to start - perhaps give them a sentence starter, or a choice of sentence starters - the improvement in quality is massive (See this post for some ideas on this)
Give feedback on the reflections: As part of responding to the reflections in marking dialogue give guidance on how they could improve their reflections and not just their work.
Give time for them to improve: A given group of students that have never self assessed before shouldn't be expected to do it perfectly, but we should expect them to get better at it given time and guidance.

As ever I'd be keen to know your thoughts, your experiences and if you've got any other suggestions....