G. Abramo
National Research Council of Italy
C.A. D’Angelo
University of Rome “Tor Vergata”,, Italy
G. Abramo
National Research Council of Italy
C.A. D’Angelo
University of Rome “Tor Vergata”,, Italy
Measuring academic research productivity is a formidable task, mainly because of the lack of data on inputs. Most of the bibliometric approaches proposed to get around the problem are questionable since they are based on assumptions that invalidate them as supports for policy or management decisions. This essay presents a proxy bibliometric indicator of research productivity that overcomes most of the assumptions and limits that affect the more popular indicators.
In today’s knowledge-based economy, governments strive to continuously improve the effectiveness and efficiency of scientific systems to support competitiveness and socio-economic development. Countries are therefore increasingly moving to strengthen competitive mechanisms in public research, mainly through selective funding and merit-based access to resources (Hicks, 2012). Of the members of the (former) EU28, for example, 16 countries use some form of “performance-based research funding”, or PBRF (Zacharewicz et al., 2019). PBRF systems are generally associated with national research assessment exercises. These resort more or less extensively to evaluative bibliometrics for measuring research performance and ranking universities and public research organisations.
The next section discusses the most popular bibliometric indicators used to assess research performance. The essay then presents what is arguably, to date, the most accurate bibliometric indicator of research performance. It stresses the need for governments to provide bibliometricians with input data (on labour and capital) to research institutions, the lack of which hinders precise measurements. It also presents the first results of a longitudinal analysis of academic research productivity at a national level. It shows that productivity is increasing over time for Italian academics in most research fields and overall.
Evaluative bibliometrics builds on two pillars of information: 1) publications indexed in bibliographic repertories, as a measure of research output; and 2) citations received, as a measure of their value, known as “scholarly impact”. The underlying rationale is that for research results to have an impact, they must be used and citations must certify their use. The intrinsic limits of evaluative bibliometrics are apparent: 1) publications are not representative of all knowledge produced; 2) bibliographic repertories do not cover all publications; 1 and 3) citations are not always a certification of real use and need not reflect all use.
The past two decades have seen a proliferation of research performance indicators and their variants. This has disoriented decision makers and practitioners, who are no longer able to discriminate the relative pros and cons. The next subsection analyses the most popular categories of these indicators.
One indicator of research productivity is simply the number of publications per researcher. This would be an acceptable metric if the resources used for all research were the same and if all papers, once published, were to have the same impact. However, these assumptions could not be further from the truth.
Another category consists of “citation size-independent indicators”, which are based on a ratio of citations to publications. The most popular representative of this type of indicator is the “mean normalised citation score”, or MNCS. The MNCS measures the average number of (normalised)2 citations of the publications of an individual or institution (Waltman et al., 2011). Within the MNCS category, another indicator of research performance is the share of publications belonging to the top X% of highly cited articles (HCAs).
Such “size-independent” indicators were probably devised to get around the lack of data on inputs to the research process, in particular the names and affiliations of research staff. While relatively easy to measure, both indicators – MNCS and HCAs – are invalid for practical uses. Imagine two universities of precisely the same size, resources and research fields. Two simple questions can be asked:
Which one performs better: the first university with 100 articles each earning 10 citations (1 000 total), or a second university with 200 articles, of which 100 have 10 citations, and the other 100 have 5 citations (1 500 total)?
Which performs better: the first university with 10 HCAs out of 100 publications (10% of the total) or a second university with 15 HCAs out of 200 (7.5% of the total)?
In the first example – using MNCS – the second university performs worse than the first (the first has a 25% higher mean citation count). However, using common sense, the second university is the better performer because its higher number of total citations has been produced using the same research resources available to the first university.
In the second case, the first university performs better, as the rate at which it produces HCAs is higher. However, again, using common sense, the second is the better performer as it produces a 50% higher number of HCAs from the same research spending.
This category of indicators violates the self-evident fact that if output increases under equal inputs, performance cannot be considered to have diminished. Paradoxically, an organisation (or individual) will receive a worsened MNCS should it produce an additional publication with a normalised impact even slightly below the previous value for the MNCS.
Another well-known performance indicator is the h-index. In the words of the originator, the h-index “represents the maximum number h of works by a scientist that have at least h citations each” (Hirsch, 2005). Hirsch’s intuitive breakthrough was to represent, with a single whole number, a synthesis of both the quantity and impact of the entire portfolio of a scientist’s published work.
However, the h-index also has drawbacks. First, it ignores the impact of works with a number of citations below h and all citations above h of the h-score works, which is often a very considerable share. Second, it fails to field-normalise citations, favouring publications in citation-intensive fields. Three; it fails to account for the years of life of publications, favouring older ones. Fourth, it also does not adjust for the number of co-authors and their order in the byline. Lastly, because of the different intensity of publications across research fields, comparing h-indexes for researchers across fields can lead to wrong conclusions. Each of the proposed h-variant indicators tackles one of the many drawbacks of the h-index while leaving the others unsolved. Therefore, none can be considered entirely satisfactory (Iglesias and Pecharromán, 2007; Bornmann et al., 2008).
However, all of the above performance indicators share a common problem: they all focus on outputs and ignore the inputs to research.
Research performance evaluations based on the above indicators are, at best, of little value. Indeed, they could be dangerous due to the distortions embedded in the information provided to decision makers.3 Some years ago, to overcome the limitations of these indicators, the authors conceived, operationalised and applied a proxy indicator of research productivity derived from the microeconomic theory of production: “Fractional Scientific Strength”, or FSS (Abramo and D’Angelo, 2014).
In simple terms, the FSS of a researcher is the ratio of the value of research output, in a given period, to the cost of the inputs used to produce it. The output consists of researchers’ contributions to their publications indexed in bibliographic repertoires. Citation-based metrics measure the value of each publication.4 The cost of inputs consists of the researcher’s wage (labour) and other resources (capital) used to carry out the research.5
Unlike the most popular indicators mentioned above, the FSS accounts for input data in addition to output. Nevertheless, all the usual limits of evaluative bibliometrics apply here too. First, publications are not representative of all knowledge produced. Second, bibliographic repertories do not cover all publications. Finally, citations are not always a certification of real use or representative of all use. Furthermore, results are sensitive to the classification schemes adopted for both publications and professors.
Because the intensity of publication varies across research fields, researchers’ productivity is compared to that of others in the same fields.6 For the same reason, productivity at the aggregate level (university, department, research group, discipline or field) cannot be measured by simply averaging the productivity of individual researchers (of each university, department, etc.). A three-step procedure is required: measuring the productivity of each researcher in a field; normalising the individual’s productivity by the average in the field (for instance, an FSS value of 1.10 means the researcher’s productivity is 10% above average); and finally, averaging the normalised productivities.
The FSS can be applied to the Italian academic context thanks to access to input data not readily available in other countries.7
The authors assessed variation in research productivity of all Italian professors in the sciences between two consecutive periods: 2009-12 and 2013-16. The analysis was restricted to Italian professors because data were lacking on inputs for public research organisations other than universities in Italy, and for universities and public research organisations in the rest of the world. The choice of four-year observation periods helps assure the robustness of the results. Input data refer to the two periods, while output data refer to a period one year later. This assumes it takes, on average, a year from knowledge production to its publication.
In the Italian academic system, professors are classified as working in “scientific disciplinary sectors” (SDSs) – e.g. experimental physics, physics of matter, analytical chemistry, organic chemistry, etc. SDSs, in turn, are grouped into “university disciplinary areas” (UDAs), e.g. physics, chemistry, etc. Analysis was limited to the UDAs where bibliometrics can be applied (10 in all, containing 215 SDSs). The analyses were carried out at the SDS level and then aggregated to the UDA and overall levels.
Note: The 2009 and 2010 values are inferred.
Source: Ministry of University and Research, http://cercauniversita.cineca.it/php5/docenti/cerca.php, for number of professors; ba.miur.it, for total net revenues.
Source: Data elaborations by the authors based on Web of Science Italian publications.
With respect to research inputs, Figure 1 shows that between 2009 and 2016 the total number of professors decreased by over 10%. In the same period, the overall net revenues of public universities decreased significantly until 2013, with a slight recovery after that. This implies the U-shaped plot of the yearly resources per capita used for research (dotted line in the right panel of Figure 1).
Figure 2 reports by UDA the change in research productivity in the later four-year period, compared to the earlier four-year period. The overall average variation is +46.6%, led by civil engineering (+112.8%), psychology (+79.5%), and agricultural and veterinary sciences (+78.4%). The lowest average increases occurred in biology (+21.4), physics (+22.9%) and chemistry (+22.9%).
Only 12 SDSs of the 215 total registered a decrease in productivity. The average decrease is -7.6%. Conversely, 203 SDSs registered a productivity increase (+49.0%).
Bibliometrics can inform large-scale assessments of academic research productivity in the sciences. However, the lack of input data in most countries has led bibliometricians to develop indicators based on assumptions that largely invalidate them as a support to policy or management decisions. Governments and research institutions may well expect precise and reliable performance evaluations in support of decisions and policy making. If so, they must be prepared to provide bibliometricians, wherever possible, with the data necessary for the assessments (i.e. name and affiliation of scientists, field of research, wage or academic rank, other resources allocated, etc.).
Such data are largely available in Italy, which has permitted conception and application of an output-to-input indicator of research performance at individual and aggregate levels. An intertemporal analysis between two consecutive periods, 2009-12 and 2013-16, showed that Italian academics in science increased research productivity overall and in about 95% of research fields.
The reasons behind these widespread and noticeable increases in productivity are arguably to be found in the competitive mechanisms introduced by the Italian government in the public research sector. These comprise selective funding, merit-based access to public resources and performance-based access to academia. In particular, the government set up a national agency to evaluate research. It completed the first national research assessment exercise in July 2013 with the publication of the university performance ranking lists. Finally, it set up the national scientific accreditation scheme for professorships, based on bibliometric performance indicators and relevant thresholds.
A key question for all countries is how to conduct national or international research productivity assessments without relevant data on research inputs. There are two possible research trajectories in evaluative bibliometrics. The classic one, followed by most, keeps ignoring input data and tries to improve the old output-based indicators (or propose new ones). Conversely, a new paradigm tries to find ways to identify and account for input data. Among the latter, one stream of research is trying to trace the research personnel of the institutions indirectly, through their publications, using bibliographic repertories that embed the possible affiliation of each author.
The authors are currently investigating the extent of bias in the productivity rankings of research organisations. They are looking for bias when assessments are based on research personnel identified through such “indirect” methods. This aims to inform policy makers’ decisions on whether to invest in building national research staff databases instead of settling for indirect methods, which have their own measurement biases.
Abramo, G. and C.A. D’Angelo (2018), “A comparison of university performance scores and ranks by MNCS and FSS”, Journal of Informetrics, Vol. 10/4, pp. 889-901, https://arxiv.org/abs/1810.12661
Abramo, G. and C.A. D’Angelo (2014), “How do you define and measure research productivity?” Scientometrics, Vol. 101/2, pp. 1129-1144, http://doi.org/10.1007/s11192-014-1269-8.
Abramo, G. et al. (2020), “Comparison of research productivity of Italian and Norwegian professors and universities”, Journal of Informetrics, Vol. 14/2, pp. 101023, https://doi.org/10.1016/j.joi.2020.101023.
Abramo, G. et al. (2019), “Predicting long-term publication impact through a combination of early citations and journal impact factor”, Journal of Informetrics, Vol. 13/1, pp. 32-49, https://doi.org/10.1016/j.joi.2018.11.003.
Archambault, É. et al. (2006), “Benchmarking scientific output in the social sciences and humanities: The limits of existing databases”, Scientometrics, Vol. 68/3, pp. 329-342, https://doi.org/10.1007/s11192-006-0115-z.
Bornmann, L. et al. (2008), “Are there better indices for evaluation purposes than the h-index? A comparison of nine different variants of the h-index using data from biomedicine”, Journal of the American Society for Information Science and Technology, Vol. 59/5, pp. 830-837, https://doi.org/10.1002/asi.20806.
Hicks, D. (2012), “Performance-based university research funding systems”, Research Policy, Vol. 41/2, pp. 251-261, https://doi.org/10.1016/j.respol.2011.09.007.
Hirsch, J.E. (2005), “An index to quantify an individual’s scientific research output”, in Proceedings of the National Academy of Sciences, Vol. 102/46, pp. 16569-16572, https://doi.org/10.1073/pnas.0507655102.
Iglesias, J.E. and C. Pecharromán (2007), “Scaling the h-index for different scientific ISI fields”, Scientometrics, Vol. 73/3, pp. 303-320, https://doi.org/10.1007/s11192-007-1805-x.
Waltman, L. et al. (2011), “Towards a new crown indicator: Some theoretical considerations”, Journal of Informetrics, Vol. 5/1, pp. 37-47, https://doi.org/10.1016/j.joi.2010.08.001.
Zacharewicz, T. et al. (2019), “Performance-based research funding in EU member states – a comparative assessment”, Science and Public Policy, Vol. 46/1, pp. 105-115, https://doi.org/10.1093/scipol/scy041.
← 1. A corollary is that evaluative bibliometrics should not be applied to the arts and humanities, due to the scarce coverage of these fields in bibliographic repertories (Archambault et al., 2006).
← 2. Citations are normalised to the average citations of all world publications of the same year and field. This aims to avoid favouring older publications, which would accumulate more citations simply because there has been more time for them to be cited, or publications falling in fields with a high intensity of citation.
← 3. To appreciate the magnitude of such distortions, see Abramo and D’Angelo (2018).
← 4. Weighted combinations of normalised citations and normalised impact factor (i.e. the prestige) of the hosting journal are used. These are the best predictors of publications’ future total citations (see Abramo et al., 2019).
← 5. Abramo et al. (2020) explains the limits and assumptions embedded in the operationalisation of the measurement.
← 6. In Italy, all academics are officially classified in one and only one field. In countries where this classification is missing, the field of research might be identified as the field in which the scientist’s publications are most frequent.
← 7. Nevertheless, it is still necessary to make several assumptions that limit the final results. Abramo and D’Angelo (2014) describe the data, FSS formula and methods used to assess research productivity in Italy.