Point to Cover: Did The Don finish with an average of 100 plus?

(This post by the author is also available on Cricketcountry.com)

Consider a Test series with Batsman A coming in at No 3 and rattling up scores of 166, 23, 1, 40, 3, 53 not out and 20 – a total of 306.

Now think of the No 11 batsman B of the same team in the same series, who scores 8 not out, 23 not out, 11, 7 not out, 2 not out - a total of 51.

If the series has not been a particularly tall-scoring run-feast, batsman A and tailender B end up sharing the top spot in the batting charts with an average of 51.00 apiece.

However, can their performances be considered equivalent?

This discrepancy has made many a mathematically-minded follower of the game frown, and one of the more severe of the knitted brows belong to the otherwise amiable visage of Dr. Shubhabrata Das, a statistician trained from the Indian Statistical Institute and currently a professor in IIM, Bangalore.

After his first presentation on the topic at a departmental seminar of the University of Nebraska in 1993, his research on cricket averages was often sidetracked because of other pressing academic topics, but whenever he pursued it, questions were raised about the averages of cricketers with curious caveats.

If averages were adjusted to eliminate the fallacy generated by the treatment of not outs, what would the historical ranking of batsmen look like? Would Sobers and Sutcliffe exchange places in the honour roll?
Would Sir Don Bradman end up with an average of 100-plus in the modified set up?

At Durban, during the 57^th session of the International Statistics Institute in 2009, Dr. Das presented his thoughts on the topic to a lot of acclaim, and found a number of similar- minded followers of the game in the academic circles of England and South Africa. This egged him on and at present, he is in the process of finishing the final bits of data analysis before launching the work in reputed academic journals. In a couple of weeks, the detailed analysis, results and ranking will be available as a working paper of IIM Bangalore.

Let me clarify at the outset – this is not an article which scoffs at the uselessness of the currently accepted numbers in the game of cricket. Quite the contrary, it argues that figures are essential indicators of quality left in the wake of on-field exploits. There is however, a long-standing gap in the analysis. And this hole needs to be plugged as, unlike any other sport, every moment on the cricket field somehow manages to go down in the record books as numerical footprints of careers.

Even a ball passing harmlessly outside the off stump into the gloves of the ‘keeper impacts the history that is etched out by numbers. It takes the batsman one ball closer to Rahul Dravid's record of the maximum number faced, adjusts his strike-rate with a minute quiver towards the right of the decimal point and carries out slightest alterations to the economy and strike rates of the bowler. If there is no action in the middle, the captain and bowler converse, the field moves around, a fielder is dispatched to the deep – even these boil down to more minutes spent in the middle, chipping away at Hanif Mohammed's record 970-minute marathon at Bridgetown in 1958.

There are sceptics who turn their backs to cricketing numbers. Some consider statistics to be a poor reflection of the glorious game played out in the middle. Some find the superlative digital traces left by a masterly career difficult to acknowledge while indulging in fault-finding outbursts. And finally there are the ones whose worshipped idols are themselves numerically handicapped, not possessing figures fantastic enough to scale up to the heroic adulation.

While the record books admittedly have no way of capturing the elegance of an innings, there is also no documented occasion of a match being awarded to the lower scoring side based on the beauty of a late cut. At the end of the day, runs scored, wickets taken and catches held on the field go on to win a match, and by corollary, the more prolifically a batsman scores and the more frequently a bowler takes wickets, the more valuable they are to the team. Of two batsmen, if one scores consistently more than the other, the average computed for him will also end up being the higher of the two. One can take cover behind Mark Twain's largely incidental and light-hearted remark about lies, damned lies and statistics, but it is impossible to deny that averages and aggregates form excellent indicators of the quality of cricketers.

It is not surprising therefore, that batting average has for long been the standard by which a batsman is measured to gauge how he stands up against the willow wielders down the ages. And ever since the second quarter of the 20^th century, the indicator has demonstrated remarkable consistency which vindicates the faith of the cricket follower.

Leaving aside the astronomical 99.94 of The Don, the remaining super greats, greats, the very good and the good can be more or less identified by the range in which their batting averages respectively lie. This yardstick has transcended generations. An average between 55 and 60 signifies the pinnacle and 45-50 the very good, irrespective of era. Wally Hammond can be bucketed in the same group as Sachin Tendulkar and Denis Compton with Javed Miandad, much as expected, across the ripples of time.

However, happy though the general followers as well as the mathematically inclined have been by the computation of the average, some of the latter have looked askance at the way not outs are considered. As shown in the example at the beginning of the article, sometimes the results of the analysis are too glaringly counter intuitive.

The primary problem arises due to the treatment of not outs as incomplete innings, and simple addition of the accumulated runs to the total aggregate. What this implies is that irrespective of whether the score of the unbeaten innings is 2* or 227*, they are dealt with it in the same way. However, is a batsman, say Brian Lara, batting on two expected to score the same number of additional runs as when batting on 227? Should the two not be treated differently?

Let us look at the problem with the help of an analogy. Suppose in a population of five people, three pass away at their respective ages of 80, 78 and 84. The two others are still alive at 76 and 82. Is the expected longevity of the population around 80 or 133?

The second answer is ridiculously wrong, but if we compute the average longevity by applying the not out rule, we get 133.33.

The two surviving venerable gentlemen would have to be extremely lucky to dodge the doosras and googlies bowled by the agents of mortality for another 50 years. Besides, medical practitioners or insurance consultants predicting 133 would probably lose their jobs and not get them back even if they themselves lived to be 133.

Survival Analysis

Here is where statistical training provides a solution. Dr. Shubhabrata Das turned to a branch of the subject known as Survival Analysis, and used the technique of Kaplan Meier estimator.

The K-M estimator, widely used in medical research and economics, tackles the above problem by arriving at the best prediction of survival in the population based on ages of both the surviving and the expired.

The answer provided by K-M for the longevity problem is an expected survival age of 81.25, which does seem to be a lot more believable than 133.

Using the same K-M estimation in the cricketing scenario, now analogously considering the out scores to be ages at death of a population and the not-out scores to be ages of the surviving members, we can revisit the example of Batsman A and tailender B provided at the very beginning of this article.

According to the K-M estimates, Batsman A averages 59.86 and batsman B 17.50. This indeed seems a lot better than slotting both the players together with an average of 51.

Dr. Das has computed the K-M estimates for all the batsmen with averages over 50 in Test cricket. The results are documented below (minimum qualification 20 tests and 50 plus average

Updated to November 26, 2011).
.

Batsman	Traditional Avg	K-M Modified Avg	Traditional Avg ranking Avg	Rank according to KM fitted Avg
DG Bradman (Aus)	99.94	98.98	1	1
RG Pollock (SA)	60.97	62.14	2	2
GA Headley (WI)	60.83	60.89	3	5
H Sutcliffe (Eng)	60.73	59.50	4	6
E Paynter (Eng)	59.23	58.56	5	8
KF Barrington (Eng)	58.67	57.75	6	10
ED Weekes (WI)	58.61	58.56	7	7
WR Hammond (Eng)	58.45	61.07	8	3
IJL Trott (Eng)	57.79	58.23	9	9
GS Sobers (WI)	57.78	61.03	10	4
JB Hobbs (Eng)	56.94	56.53	11	14
KC Sangakkara (SL)	56.93	57.43	12	11
JH Kallis (ICC/SA)	56.89	55.61	13	16
CL Walcott (WI)	56.68	56.67	14	13
L Hutton (Eng)	56.67	57.39	15	12
SR Tendulkar (India)	56.02	55.91	16	15
GS Chappell (Aus)	53.86	54.01	17	18
AD Nourse (SA)	53.81	54.04	18	17
R Dravid (ICC/India)	53.22	53.68	19	20
BC Lara (ICC/WI)	52.88	53.04	20	23
TT Samaraweera (SL)	52.61	53.22	21	22
Javed Miandad (Pak)	52.57	53.87	22	19
RT Ponting (Aus)	52.53	52.86	23	24
Mohammad Yousuf (Pak)	52.29	51.16	24	26
V Sehwag (ICC/India)	52.15	52.65	25	25
MEK Hussey (Aus)	51.73	51.16	26	31
J Ryder (Aus)	51.62	52.09	27	28
A Flower (Zim)	51.54	53.28	28	21
DPMD Jayawardene (SL)	51.3	51.90	29	29
Younis Khan (Pak)	51.2	52.19	30	27
SM Gavaskar (India)	51.12	51.05	31	32
SR Waugh (Aus)	51.06	51.20	32	30
ML Hayden (Aus)	50.73	50.68	33	33
AR Border (Aus)	50.56	50.15	34	37
KP Pietersen (Eng)	50.48	50.51	35	35
IVA Richards (WI)	50.23	50.60	36	34
DCS Compton (Eng)	50.06	50.48	37	36

From the table, we see that averages of certain batsmen like Wally Hammond and Sir Garfield Sobers undergo sharp increase, and in the case of Jacques Kallis there is a moderate drop. There is a significant reshuffling in average based ranking for people like Len Hutton, Alan Border and a few others.

Overall, however, it seems that during long careers, the discrepancies due to not out scores get evened out and the difference between the traditional and K-M average is not too spectacular. At the same time, we have already seen the potency of average correction in a short series. A real life example is given below.

India vs Pak in India 2005	Anil Kumble	Virender Sehwag
Runs scored	1, 21, 14, 22, 37	173, 36, 81, 15, 201, 38
Traditional Average	97.00	90.66
Kaplan Meier Estimated Average	30.00	90.66

It does also seem that the major difference and adjustment take place only for lower order batsmen who end up with a lot of low not out scores.

However, when we take a look at the One-Day Internationals (ODIs), the differences are quite drastic for some big batting names as well.

Top ODI batsmen	Traditional Avg	KM Average
HM Amla	55.17	53.09
MG Bevan	53.58	46.13
IJL Trott	51.37	49.19
MEK Hussey	51.18	44.93
MS Dhoni	51.16	51.02
Z Abbas	47.63	45.83
IVA Richards	47.00	46.95
V Kohli	46.78	43.39
MJ Clarke	45.50	43.16
J Kallis	45.49	43.40
SR Tendulkar	45.16	44.64
RT Ponting	42.64	41.68
SC Ganguly	41.02	40.59
BC Lara	40.49	39.99

As we observe, the vaunted averages of Mike Hussey and Michael Bevan do indeed take major hits when we apply the K-M adjustment. Considering that these two men have been finishers with plenty of not out innings, the results provide some food for thought.

Kaplan Meier for adjusted batting averages has been used before, notably in a paper by Kimber and Hansford in the Journal of Royal Statistical Society, 1993. Dr. Das brings in further innovation by the interesting approach of modelling the scores using something called the Generalised Geometric Distribution.

The professor argues that batsmen have a variable risk of getting dismissed based on their current score in the innings. In other words, before opening his account a batsman may be very vulnerable, growing more steady between 10-89, then vulnerable again in the 90s and so on. If we look at scorebooks, VVS Laxman is more prone to get out in the 20s, and David Gower was very susceptible in the 70s.

Based on the available data, one can try to fit the risk of dismissal for each batsman at different phases of the innings. Depending on assumptions, one can fit as rigorous a model as required, and try estimating the career average with the resulting Geometric distribution. The below table gives us some options for models along with a study of Don Bradman’s average as computed by each. As we go down the rows, the table gets more and more refined.

Model Number	Assumption on risk of getting out at different phases of an innings	Bradman’s average based on fitted Geometric model
1	At all times, irrespective of the score, the chances of getting out are the same (traditional average)	99.94
2	At zero there is a particular degree of vulnerability. At 1-9 there is a different degree of vulnerability. 10 onwards the vulnerability is something else.	100.02
3	A batsman has different vulnerability at 0, 1-9, 10-99, just after 100 (100-105) and more than 105	99.95
4	The vulnerability is different for each different score (Generalised method – equivalent to Kaplan Meier)	98.98

Indeed, for some of these models, the average of The Don does pass the three figure mark.

As a final note of interest, even though historically Bradman required just four in the last innings to end with a career average of 100, if the averages had been computed using K-M, he would have required 89. With this striking revelation, the great man now seated in the far pavilions can perhaps forgive himself for getting out when the landmark had seemed just a stroke away.

Point to Cover

Tuesday, December 6, 2011

Did The Don finish with an average of 100 plus? - A relook at batting averages

No comments:

Post a Comment