Saturday, 1 March 2014

Effect size and #RAG123

I first decided to do a trial of RAG123 in November, as described here. My thinking was that if it worked I'd share it more widely. The results were so positive that I've actually never stopped using this approach since - it's become a vital part of my marking and feedback approach, but also forms a massive part of my planning too. If you're new to it all and need a fuller description of RAG123 then see this post.

The result of just giving it a go for a week in November was that I have really struggled to find clean and clear objective data to compare a before and after to assess the impact of RAG123 on pupil progress. While I am confident that students like it, I like it, and it feels beneficial, it would be nice to have some proof that it's actually beneficial.

A prompt from Mr Benney..
I had been fairly content to just carry on with using RAG123 and accept that a clean data set was lost to me. However then Damian Benney (@benneypenyrheol) not only picked up RAG123 but then did this fantastic bit of work on a 5min Research plan for RAG123 (here), including crunching some data to work out an effect size of 0.73. (yes effect sizes need to be used with caution, but it's an indicator to go along with all the other feel good positives)

With this great data and apparently massively positive outcome to inspire me I revisited our departmental spreadsheets to see if I could find some clean comparable data. With a bit of searching I found what I was looking for. 2 parallel year 8 classes, nominally of similar ability, one taught by a teacher who started to use RAG123 early, and another who started later. This meant there was one full assessment that RAG123 was pretty much the only difference.

Results for comparison
The two groups have done 3 assessment tests so far this year - they sit identical tests during the same week of term. Scores are out of 50, and the mean scores for the two groups are as follows:
As can be seen in tests 1 and 2 the groups achieved broadly similar scores at or around the low 30s, but in test 3 group 1 takes a step up to 37.1.

Same teacher with/without RAG123
Group 1 had "normal" feedback, marking and planning up to test 2, but their books were marked and lessons planned using RAG123 for the content of test 3. Taking the step change of +4.7 marks from the average of tests 1 and 2 to the test 3 level, and dividing by the standard deviation calculated across all 3 tests (8.1) we get an effect size of +0.58. A really pleasing figure.

It's worth also noting that the standard deviation for test 3 on its own is just 6.4, compared with 7.4 and 9.3 for tests 1 and 2 respectively. As such the average score is better, and it is also less variable, in fact the lowest score on test 3 is 5 marks higher than on test 1, and 10 higher than test 2! This lower variability in scores is echoed in interquartile range, which is 9 marks for test 3 compared to 12.5 and 11.75 for tests 1 and 2.

In summary the results for test 3 suggests that RAG123 has had the following effects when group 1's performance is compared with its own prior results:

  • Raised the average mark by almost 5 marks, so almost 10% better (note median mark rose by almost 6 marks)
  • Made the group's marks more consistent as demonstrated by a reduced standard deviation and interquartile range.
  • Had an estimated effect size of +0.58
Different teacher with parallel groups with/without RAG123
This one is probably slightly sketchier, as there are more variables at play given 2 different teachers and different classes, however I'll have a go...

Group 2 was in receipt of "normal" feedback, marking and planning through all tests 1, 2 and 3.

For group 1 the mean score for the first 2 tests is 32.4, with a standard deviation of 8.4. For group 2 the equivalent mean for the first 2 tests is 31.8, with a standard deviation of 6.6. As such without RAG123 I'm broadly taking these two classes as performing at a comparable level.  (note I fully acknowledge that there is a relatively large discrepancy between test 1 and 2 for group 2 - which could make this bit of analysis a bit uncertain - what can I say, it's real data and all I've got for this analysis. If you want to ignore this data then feel free! I believe what may have happened is that group 2 were shocked by their poor scores in test 1, put more effort in test 2 and took their foot back off the gas in test 3.)

Group 1 scored on average 4.1 marks higher than Group 2 in Test 3. The standard deviation for both groups combined in test 3 was 6.9, giving an apparent effect size of 0.59.

In summary the results suggest that RAG123 may have contributed to the following effects when group 1's results are compared with those of group 2, which is nominally a parallel set, and has comparable prior results:
  • Raised mean mark by about 4 marks, approx 8% (note median mark was actually 7 marks higher)
  • Given an estimated effect size of +0.59
  • However given the greater variability in marks for group 2 there is a level of uncertainty in this data.
Visible progress
Perhaps more powerful than all of these cold hard stats though are the colour progress charts we use for these groups (colours indicate progress to target). I've tweeted these kinds of thing before but it's quite stark when you compare the two groups in this particular study...

Group 1 looks like this: (red, yellow & orange are below target, green, blue and purple are on or above)

Group 2 looks like this: (red, yellow & orange are below target, green, blue and purple are on or above)

In terms of perfomance to target it is clear that group 1 are doing MUCH better in test 3 than either group has ever done.

Overall effects
As an overview all the data analysed so far supports the feel that RAG123 is effective. I'm sure the data has holes in it, but ALL data collected in education has that because it deals with individual students and individual teachers, and it is impossible to completely eliminate other factors from the experiment.

The data shown here undoubtedly supports the fact that RAG123 improves progress, both when the teacher is kept constant and marking changed or when the test & type of class is kept constant. Taking Mr Benney's study into account that's 3 analyses that show effect size a good step bigger than 0.5. Whether the effect size is really 0.58, 0.73 or another value is probably beside the point...

 The bottom line is it's demonstrably beneficial AND it's better for overall workload!!

Once you get started with RAG123 it just feels right. The students respond really positively and REALLY value the little comments and dialogue that develops with the teacher. We've even had one student choose to make a child protection related disclosure by writing it down in their maths book - In later discussions it became clear that they chose this route because they knew their maths teacher reviewed their books daily and would respond to it! Clearly this is a sad situation for the child, but they're now getting the help they need and it's a powerful endorsement of the relationship that RAG123 helps to build.

As a further example of the kind of thing RAG123 can do here is one student's self assessment comments from Thursday this week:
He knew he'd not worked hard enough and was honest enough to both say so and apologise. (In truth though he had done more than quite a few others, and I'd have probably given him an amber for effort, but I very rarely raise a pupil's self assessed effort level - if they think they could have worked harder then who am I to argue!)

On Friday he arrived in lesson with a completely different attitude, completed masses of work and left me this comment:
I didn't give him a sanction for lack of effort - he self assessed and knew he needed to improve on yesterday... He then delivered on it! This kind of self reflection and reaction is commonplace for RAG123 marking.

What's stopping you?
Frankly if you're still skeptical that RAG123 is worthwhile then I have to wonder what more evidence is needed!!

If you've not tried it yet I strongly urge you to give it a go. 4 months since the #RAG123 hashtag was created and I'm losing track of the number of people using this approach. However I've still NEVER found anyone who has tried it that says they want to stop. In fact I've NEVER had anyone who has tried it say it is anything other than beneficial on all levels!!

For those of you that have tried it - if you've got data of before/after that could be analysed it would be really interesting to see that too!

As always all thoughts & comments appreciated....

No comments:

Post a Comment