Construct validity of NIRF 2019 scores
There are nearly 900 Universities, more than 40,000 Colleges, and another 10,000 Stand Alone Institutions in the higher education sector in India. The higher education sector is so fragmented among 50,000 units of assessments that few institutions have the critical mass to deliver high end research along with high quality education. The National Institutional Ranking Framework (NIRF) of India has been attempting to bring some order to the evaluation of this sector and since 2016 has been ranking institutions that volunteer to participate in the exercise. The latest 2019 rankings has 1,479 entries (out of 4867 applications for ranking made by 3127 unique applicant institutions under various categories/domains) grouped into 9 categories(anOverall rank, category-specific ranks for Universities and Colleges, and domain-specific ranks in Engineering, Management, Pharmacy, Medical, Law and Architecture).
NIRF identifies between 16-18 parameters in five major groups. Four of these parameters resemble those used in various university ranking exercises internationally. However, an interesting India-centric parameter, which in our convergent validity paradigm, we count as a non-overlapping factor, is the ‘Outreach and inclusivity (OI)’ parameter reflecting the regional diversity of a sub-continental region, outreach, gender equity and inclusion of disadvantaged sections of the society.
A more detailed description of the NIRF and the two scores we will use in the construct validity exercise, namely the NIRF score and the PERCEPTION score is given in the next section. This is followed by a description of the methodology to obtain independently a second-order X-score by considering one of the parameters as an input term and two other parameters as two-dimensional output terms. One section is devoted to describing the relevance of construct validity as the NIRF is actually the assessment of a very complex social system leading to a three-way comparison of scores. The final sections discuss the results and make the concluding remarks.
The National Institutional Ranking Framework (NIRF) of India
The National Institutional Ranking Framework (NIRF) of India reduces the vast complexity of higher education into a single score for each of its participating institutions. Now in its fourth year, NIRF has just released its 2019 rankings of higher educational institutions across the country (https://www.nirfindia.org/2019/Ranking2019.html). NIRF, goes beyond other international university ranking schemes which are based on educational and research excellence, by adding socially desirable indicators. Five broad generic groups are combined covering aspects classified under the heads ‘Teaching, learning and resources (TLR),’ ‘Research and professional practices (RPC),’ ‘Graduation outcomes (GO),’ ‘Outreach and inclusivity (OI)’ and ‘Perception’. This five-dimensional picture is further elaborated through sub-heads, with weights assigned to each broad head, and more weights assigned to the sub-heads within each head. For each sub-head, a score is generated using suitably proposed metrics, and the sub-head scores are then added to obtain scores for each individual head. The overall score is computed based on the weights allotted to each head. This score can take a maximum value of 100. Thus, what is ahugely multi-dimensional input and output problem is compressed into a single score on the basis of which institutions, irrespective of size or resources, are finally rank-ordered based on these scores.
When we address the issue of validity in complex social systems, we find that there is no such thing as an independent ground truth; the closest we have here is the Perception score, which is an observed variable and not a latent variable emerging from a mathematical model. In the NIRF operationalization, this is included as an input to get a final NIRF score, although with a very low weighting, i.e. 10%. From the construct validity point of view (an elaboration of which follows in the next section), it is more meaningful to use the Perception score as a baseline (Bornmann et al.2019) with the NIRF score to get a better appreciation of the various biases, prejudices and mis-measures involved. Further, using an argument based on separating input scores from output and outcome scores and size-independent from size dependent scores (Prathap 2017b), a second-order X-score is obtained as a measure of performance that can be compared in a three-way construct validity exercise against the NIRF score and the Perception score. This is discussed in a separate section below.
There are many concerns of an epistemological and philosophical nature that the NIRF exercise overlooks. While the input data for NIRF supplies evidence, there is no theory to defend the linear addition of weighted scores of the heads and sub-heads. It is also questionable whether the PERCEPTION score should be incorporated within the NIRF score, or should be used to validate the NIRF score. The X-score (see section below) is however based on a thermodynamic theory of performance (Prathap 2011) that separates input from output and defines a size-independent quality or excellence proxy as a ratio of output to input.
The PERCEPTION score is an observed variable and can interpreted as a “ground truth” (however faulty it is; this will become evident when the construct validity maps are drawn) against which the X-score and the NIRF score will be validated. Indeed, if one may so desire, the X-scores and the NIRF scores, which are latent variables emerging from the mathematical models, can also be chosen as the ground truth.
Methodology of the X-score
For each institution, category wise, NIRF makes available, a set of five parameters. Thus, as an example, for Indian Institute of Technology Madras, in the Engineering category, we have TLR=93.55, RPC=92.39, GO=84.36, OI=63.99 and PERCEPTION=100. These are based on a possible maximum score of 100 for each indicator. Note that from an econometric and scientometric evaluation protocol (Prathap 2017c), TLR is an input score related to teaching and learning resources while RPC and GO are output or outcome scores related to research and teaching performance. OI is only of sociological relevance and is not considered here for computing the X-score.
We look at the data for the top 100 institutions in three categories: Overall, University and Engineering. In each case, we treat the TLR term as a single input and RPC and GO as two-dimensional output terms. We use a totalization strategy (Prathap 2018), so that for each i-th institution, we get an input term I(i) = TLR(i)/SUM(TLR(i), i=1,100), and an output term O(i) = (RPC(i)/(SUM(RPC(i), i=1,100) + GO(i)/(SUM(GO(i), i=1,100))/2. The quality term in each case is q(i) = O(i)/I(i). This implies that q= 1 is the norm or average performance of all the top 100 institutions in the category. The second-order X-score is simply X(i) = q(i)O(i) = q(i)2I(i). This serves as asingle-valued scalar measure for the research and teaching performance of each institution and is a second-order exergy term (Prathap 2011) from the input, output and quality (excellence) indicators, X = q2I = qO. This exercise is done for the top 100 institutions colleges in the NIRF 2019 rankings for three categories: Overall, Universities and Engineering.
Construct validity and the NIRF assessment
Construct validity can be defined as the extent to which an operationalization measures the concept it is supposed to measure (Cronbach & Meehl 1955; Cook & Campbell 1979). The NIRF scores are intended to measure and comparatively rank, the performance of a higher educational institutions through a complex operationalization. For each institution, it reduces multi-faceted input and output terms into a single number; there is always the danger of a single story (Prathap 2017b). Convergent validity obtains when the measure is associated with things it should be associated with; otherwise, discriminant validity is the case when the measure is not associated with things it should not be associated with.
We thus have three measures of constructs to evaluate scientific excellence. One is based on peer review (PERCEPTION) and the others (NIRF and X) are based on publication and citation data (RPC) and graduation outcomes (GO). Note that NIRF, in a self-referential way, incorporates PERCEPTION into itself, and also incorporates a sociologically relevant but non-overlapping construct, namely OI. Convergent validity (Bornmann & Daniel 2007, Bornmann et al.2019)can be established if any two similar constructs correspond with one another. Bornmann et al.(2019) use peer assessment as the baseline, but here we shall proceed with a three-way comparison in each category: NIRF score-PERCEPTION, NIRF score-X score and X-score-PERCEPTION. For each category we shall also draw what we call Convergent Validity maps in addition to using correlation coefficients. Departure from a y=x line is seen as absence of evidence of convergent validity.
Results and Discussions
Table 1 is an extract from the rankings for the Overall Category showing the Top 10 institutions ranked using NIRF scores. TLR is shown as a single input term while RPC and GO are shown as output or outcome indicators. Note that for our argument here on construct validity, OI is a sociologically relevant (especially in the Indian context) but non-overlapping construct as far as teaching and research performance is concerned.If all 100 in the Top 100 rankings are considered, there is a significant variation in RPC score, from 3.16 for the Datta Meghe Institute of Medical Sciences to89.24 for the Indian Institute of Science.The variations in TLR (46.51 to 84.56), GO (45.09 to 99.87) and OI (39.42 to 75.87) among the Top 100 are more modest. However, the scores for PERCEPTION range all the way from 0 to 100! Note that we interpret this as a “ground truth” based on peer review from the convergent validity point of view (Bornmann & Daniel 2007, Bornmann et al.2019). The variations seem to indicate that while RPC and PERCEPTION are allowed to range freely from 0 to 100, this is not seen for the TLR, GO and OI core, where they have been telescoped into a narrow band, with an corresponding compounding effect on the final NIRF scores, a feature seen and reported earlier (Prathap 2017a). The Pearson’s correlations for the Top 100 in the three categories which are shown in Table 2 reveal some interesting insights. The correlations between input (TLR) and the output terms (RPC and GO) are very small in the Overall and University categories. The high correlation between RPC and PERCEPTION in all three categories (0.77, 0.86 and 0.87) seems to indicate that perception scores are biased by peers and employers in anticipation of the institutions’ research capability. In all three categories, there is little or no correlation (0.00, 0.04 and 0.15) between OI and PERCEPTION – this suggests that OI can be presumed to be a non-overlapping factor as far as performance evaluation is concerned. In all three categories, there is little or no correlation between OI and RPC (-0.08, -0.01 and 0.03) – again justifying our assumption that OI is a non-overlapping factor as far as research practice evaluation is concerned.
We have seen earlier that there are three scores that can be “validated” against each other leading to a three-way representation. In Figures 1 to 3, we see the three possibilities for the institutions in the Overall category. One commonly used approach to study construct validity of competing measures is to use correlation coefficients. Table 3 shows the Pearson’s correlations for the three construct validity measures of the Top 100 in the three categories. Excellent agreement is seen (correlation coefficients range from 0.76 to 0.88) but this is deceptive. Instead of relying only on correlation coefficients, we can get a good visual appreciation by marking on each map the y=x line (represented by the green dotted lines). The red dotted lines represent the mean scores in each case, and these help divide the map into four quadrants. The quadrant division allows one to identify type I and type II errors in a convergent validity exercise if this is indeed needed (Bornmann & Daniel 2007, Bornmann et al.2019). The three maps show that the NIRF scores are telescoped into the 35-85 range while the Perception scores and X scores span the whole range from 0 to 100. In Fig. 1, we can take either the Perception score or the NIRF score as the ground truth and validate one against the other. We find that the Perception scores considerably underestimate performance at the low end of the spectrum where most of the institutions are. Conversely, one can also argue that the NIRF scores considerably overestimate performance at this end of the spectrum. This is also the picture that emerges from Fig. 3; the NIRF scores favour the low performing institutions by assigning higher scores than computed through the X-scores model. In Fig. 2, two institutions are prominently outliers in a Type I or Type II sort of way – the Perception score in an egregiously false positive way favours them as compared to a X-score evaluation.
Nearly identical maps emerge for the other two categories and all nine maps are displayed as a collage in Fig. 4. In the University category we can see the Indian Institute of Science prominently ahead of the rest. In the Engineering category, we see a cluster of the five older Indian Institutes of Technology ahead of the rest. The biases are clearly noticeably. In all three categories, the PERCEPTION score considerably underestimates most institutions and the NIRF scores generally overestimates most institutions.
We looked at the 2019 scores from National Institutional Ranking Framework (NIRF) for the top 100 institutions from India in three categories from the construct validity point of view. The NIRF exercise provides a final score (the NIRF score) from five broad parameters for participating institutions. One parameter is an observed variable and is a peer review-based PERCEPTION score. Using the TLR parameter as a proxy for teaching and learning resources input and the RPC and GO parameters as proxies for teaching and research outputs or outcomes, we independently computed a second-order X-score. The NIRF scores and the X-scores are latent variables that emerge from mathematical models. The three scores, are compared in the context of construct validity and weaknesses and biases can be recognized while validating such multi-dimensional evaluation exercises.
Bornmann, L., & Daniel, H. D. (2007). Convergent validation of peer review decisions using the h index: Extent of and reasons for type I and type II errors. Journal of Informetrics, 1, 204–213.
Bornmann, L., Tekles, A. & Leydesdorff, L. (2019). How well does I3 perform for impact measurement compared to other bibliometric indicators? The convergent validity of several (field-normalized) indicators. Scientometrics, https://doi.org/10.1007/s11192-019-03071-6.
Cook, T. D., & Campbell, D. T. (1979). Quasi-Experimentation: Design & Analysis Issues in Field Settings. Boston: Houghton Mifflin.
Cronbach, L. J., & Meehl, P.E. (1955). Construct Validity in Psychological Tests. Psychological Bulletin, 52 (4), 281–302. doi:10.1037/h0040957.
Prathap, G. (2011). The Energy–Exergy–Entropy (or EEE) sequences in bibliometric assessment. Scientometrics, 87(3), 515-524.
Prathap, G. (2017a). Making scientometric sense out of NIRF scores. Current Science, 2017, 112(6), 1240-1242.
Prathap, G. (2017b). Danger of a single score: NIRF rankings of colleges. Current Science, 113(4), 551-553.
Prathap, G. (2017c). Making scientometric and econometric sense out of NIRF 2017 data. Current Science, 113(7), 1420-1423.
Prathap, G. (2018). Totalized input-output assessment of research productivity of nations using multi-dimensional input and output. Scientometrics, 115(1), 577-583.