Assessing the robustness of composite indicators: the case of the Global Innovation Index

This research paper introduces a methodology to assess the robustness of the Global Innovation Index (GII), by comparing the rankings provided in it with those achieved using alternative data-driven methodologies such as data envelopment analysis (DEA) and principal component analysis (PCA). With it, the paper aims to reduce the level of subjectivity in the construction of composite indicators regarding weight generation and indicator aggregation. The paper relies on PCA as a weighting-aggregation scheme to reproduce the 21 sub-pillars of the GII before the application of DEA to calculate the relative efficiency score for every country. By using the PCA-DEA model, a final ranking is produced for all countries. The random forests (RF) classification is used examine the robustness of the new rank. The comparison between the new rank and that of the GII suggests that the countries positioned at the top or the bottom of the GII rank are less sensitive toward the modification than those in the middle of the GII, the rank of which is not robust against the modification of the construction method. The PCA-DEA model introduced in this paper provides policymakers with an effective tool to monitor the performance of national innovation policies from the perspective of their relative efficiency. Ultimately, the contribution made in this paper could be instrumental to enhance the effectiveness and the efficiency of the practice of innovation management at the national level.


Introduction
For a long time, continuous efforts have been deployed to improve composite indicators (CIs) as a tool for measuring national innovation systems' performance (Bandura, 2011;Dutta & Lanvin, 2013;Edquist et al., 2018;Corrente et al., 2021;Alnafrah, 2021).Innovation, according to the Oslo Manual (OECD/Eurostat, 2018, p. 20) is defined as 'new or improved product or business process (or combination thereof ) that differs significantly from the firm's previous products or business processes and that has been introduced on the market or brought into use by the firm' .In this regard, national innovation systems represent the harmonious combination of several facilitators and parties such as institutional structures, infrastructures, or any supporting activities and policies that orchestrate together to facilitate and create an appropriate environment to foster innovation (Lundvall, 2007).The need for innovation and knowledge-driven growth is no longer exclusively for developed countries, as developing countries are also acting to design policies aiming to enhance their innovation performance to achieve better economic and environmental growth (Broughel & Thierer, 2019;Nikolaidis et al., 2013).
CIs represent the aggregation of a set of individual indicators into a single index that aims to measure a multi-dimensional concept (OECD, 2004).In the last decades, a growing trend has emerged among policymakers, media, and all areas of research regarding CIs and the rankings derived from them-examples of CIs include education, health, government quality, innovation- (Bandura, 2011;Barbero et al., 2021;Crespo & Crespo, 2016;Greco et al., 2019;Hatefi & Torabi, 2010).These rankings made a significant impact on the perceived image of how decision-making units (e.g., countries, regions, cities, universities) are relatively performing (Cherchye et al., 2008;Zabala-Iturriagagoitia et al., 2007a).Not only do they influence people's desire to visit or to invest in these countries, but also to legitimize the policies adopted in these countries to support the development of certain activities (i.e., education, economy, health, etc.).In the case of innovation systems, increasing efforts have been devoted to developing tools and indicators to assess economic and political potential, in order to provide countries with scientific evidence that helps to develop policies and activities towards the development, diffusion, and adoption of innovation (Edquist, 2011).CIs play a fundamental role in such endeavors.Nevertheless, several concerns emerge from this general context, namely: how much does the ranking provided by a CI represent the actual performance of a particular decision-making unit?And, how robust are the results against the methodology used in such CIs?This paper aims to contribute to clarifying these research questions, by focusing on the analysis of the statistical information provided by the 2019 edition of the Global Innovation Index (GII).
The literature has evidenced how the underlying methodology followed in the construction of CIs has a direct influence on its final results and rankings (Edquist et al., 2018;Greco et al., 2019;Grupp & Schubert, 2010;OECD, 2008;Zabala-Iturriagagoitia et al., 2007a).However, the CIs designed to measure such issues like innovation, education, health, etc., have remained blind to this evidence.The examination of more methodologies to assess whether the results of CIs change depending on the way in which data are processed, weighted, or aggregated is referred to as the robustness of the CI (OECD, 2008).Out of the multiple CIs that have been developed to assess innovation potential, this paper examines the robustness of the 2019 edition of the GII.To do that, this paper applies alternative data-driven techniques such as data envelopment analysis (DEA), principal component analysis (PCA), and random forests (RF) to produce a new rank for the GII countries to examine the robustness of the GII.
The structure of this paper can be summarized as follows: Sect.The Global Innovation Index and the need for different perspectives presents the relevance of the GII and its main characteristics, as well as the main concerns about its structure and methodology.Moreover, it highlights the need to combine different perspectives to provide a sound methodological analysis of the GII.Section Methodological challenges in the construction of composite indicators presents the main methodological challenges that face the construction of CIs.Section Composite indicator weightin discusses the methodology of the new model (PCA-DEA) introduced in this paper.Section Composite indicator aggregation presents the results obtained by the PCA-DEA model, as well as the explanation of the outcomes.It also provides a comparison between the original GII rank and the rank obtained with the new model.Section Robustness of composite indicator discusses the main conclusions and the contribution of the paper.

The Global Innovation Index and the need for different perspectives
The GII is a global comparative study that involves 129 countries using detailed evidence of 80 indicators (i.e., in its 2019 edition), accounting for around 95% of the global Gross Domestic Product.It provides a tool to measure national innovation performance using two sub-indices, innovation input, and innovation output.There are five pillars under the input sub-index, which consists of Institutions, Human Capital and Research, Infrastructure, Market sophistication, and Business sophistication.In turn, the innovation output sub-index consists of two pillars, Knowledge and Technology Output and Creative Output.Under each pillar, there are several sub-pillars, and under each sub-pillar, there are several indicators (Dutta et al., 2019, p. 307).
The GII represents a significant source of insight in relation to national level sustainable innovation and aims to evaluate the environment supporting innovation at the national level.At the same time, it also helps to determine conditions for the diffusion of innovation and its importance for a country's development (Dutta et al., 2019).The GII is distinguished from the other composite innovation indices, as it emphasizes various components related to intangible assets, such as trademarks, global brand value, and industrial designs (Dutta et al., 2022).Owing to this feature, the GII can be used as a leading reference for researchers, business executives, and policymakers toward creative intangible asset-related innovation.
The GII is not designed to be the ultimate tool to reinforce rankings with respect to economies and countries.Rather, it provides a foundation to be able to continually evaluate the factors of innovation which can provide much insight for economies.In its 2019 edition, the GII provided a rich database of detailed metrics to redefine innovation policies, which are imperative for policymakers.There are many layers in measuring and understanding innovation, due to its multi-dimensional nature, which can help identify core and best practices and focus on holistic policies (Decancq & Lugo, 2013;Edquist et al., 2018;Zabala-Iturriagagoitia et al., 2007a).The intricate data allow for economies to monitor performance over time and thus standardize developments concerning other economies, allowing comparisons and the identification of best practices, which further supports the aim of the GII.Additionally, the 2019 edition of the GII highlights that it should be "regarded as a sound attempt, matured over 12 years of constant refinements, to pave the way for better and more informed innovation policies worldwide" (Dutta et al., 2019, p. 387).With this is in mind, it is important to recognize that the GII is not the be-all and end-all in rankings within innovation, but a resource that can be used to assist countries in policymaking.
The GII has persistently maintained a similar methodology in all the editions published over the last 10 years, with very slight changes in the weights attributed to its indicators.As a result, it has presented consistency in the results, yielding similar rankings with very small variations across countries.In particular, the aggregation methods followed by the GII are the arithmetic average and the weighted average.Also, it attributes the rationality of assigning the weights only for "statistical coherence" and "highest correlation" between indicators, sub-pillars, and pillars (Dutta et al., 2019, pp. 371-373).In the construction of every CI, assigning weights is an imperative part of the process, and the level of subjectivity should be treated with caution since it has bearings on the final results (Greco et al., 2019).In the case of the GII methodology, the allocation and production of weights imply several concerns within the methodology.First, the GII reveals how weights have been generated at the level of the seven pillars, but it has withheld how weights have been generated for its indicators and sub-pillars.Second, the 2019 edition of the GII specifically adjusted the weights for three indicators to provide "statistical coherence".Third, it needs to be taken into consideration that five of the indicators included in the 2019 edition were collected by subjective means such as using qualitative data, e.g., stakeholder interviews, which might unravel multiple layers of subjectivity (Dutta et al., 2019).Fourth, in the process of data aggregation, the arithmetic average has been used in multiple layers, which is an unreliable statistical tool when combining large amounts of data, since the arithmetic average disregards any significance of certain variables over the others (Ćudić et al., 2022).Interestingly, the GII does not indicate any methodological rationale for using the arithmetic average in the process of aggregation.Furthermore, when aggregating 80 indicators by using averages, there is a high probability that two or more indicators are correlated, creating an additional bias in the results.Retrospectively, duplicating weights may distort the soundness of the outcomes.Considering these methodological concerns, the application of methodologies such as PCA is highly recommended to produce these relative weights (Adler & Yazhemsky, 2010), as it can help to solve this problem, by grouping indicators along with a set of consistent components, each being given a relative weight.
As previously argued, the GII has presented stability in their rankings for a long period of time, which may not aid in dynamically redefining policies, as the challenges facing innovation systems evolve over time.Additionally, it is important to recognize that some countries that follow the direction of the GII, may reap the benefits of its ranking, but this does not necessarily aid the economic growth of that country, as the GII paints all countries with the same brush, disregarding each country on its own merits (Crespo & Crespo, 2016;Jankowska et al., 2017).Therefore, the literature has already tested new methods for data processing within the GII.For instance, Cui et al. (2020) used RF and Artificial Neural Networks to recalculate the GII scores for years 2019 and 2020 by using 14 indicators.Along the same line, Pence et al. (2019) estimated the GII countries' scores, using the Artificial Neural Networks by choosing 27 indicators of the GII 2016.Consequently, the possibility to approximate the GII countries' score by using datadriven approaches with fewer parameters may help countries to monitor the change of their innovation performance in a shorter cycle.
The use of the unweighted average between the GII sub-indices to calculate the general index hosts several issues.Firstly, it needs to be considered that the output sub-index disregards the number of innovation inputs utilized.Secondly, it does not seem appropriate to weight the contribution of the output sub-index equally to the contribution of the input sub-index, since the number of the GII pillars on both sides is not equal.Furthermore, the use of unweighted average to aggregate the data at the level pillars/subindices, unnecessarily suggests the same significance for all of them within all countries.Accordingly, implying that all countries can be clustered in one big group measured in the same way may yield an unfair scale (Stavbunik & Pelucha, 2019).Additionally, the GII has numerous missing data values, and hence, estimating or imputing these data points will lead to higher precision in results (Cui et al., 2020;Omer et al., 2020).
The GII produces efficiency scores.However, in determining the GII data structure (input and output), the efficiency of an innovation system can be paired with productivity, i.e., the amount of output a system can generate using a certain amount of input (Edquist et al., 2018;Jankowska et al., 2017).Hence, in order to compare the efficiency of two systems, it is necessary to measure the output that can be generated by utilizing the same input or less (Dutta et al., 2019).In stipulation to this notion, there is always room to improve the GII methodology by using data-driven techniques such as DEA, which seems a very appropriate technique to measure the efficiency of a national innovation system (Alnafrah, 2021;Barbero et al., 2021;Hatefi & Torabi, 2010;Omrani et al., 2019;Zabala-Iturriagagoitia et al., 2007b), given that the generation of the weights while using the DEA is taking place without prior intervention, and the efficiency score for each country is relative to all the other countries.

Methodological challenges in the construction of composite indicators
CIs play an essential role in formulating the shape of the innovation policies and the awareness of them (Edquist et al., 2018).Meanwhile, it is crucial to set the characteristics of the policies that can fulfill the needs of the system (Zabala-Iturriagagoitia et al., 2007a).Therefore, it is alarming to accept and use them simplistically without any scrutiny and discussion (Grupp & Schubert, 2010).Accordingly, there are at least three fundamental challenges that need to be considered when creating CIs: (1) the relative weights associated with the indicators included in it, (2) the methodology used in the aggregation of these indicators, and (3) the robustness of its results (Freudenberg, 2003;Greco et al., 2019;Munda, 2012;OECD, 2008;Saisana et al., 2005).The misuse of the first two factors has a direct impact on the robustness of the final outcomes.

Composite indicator weighting
In the process of constructing a CI, selecting or developing the most sensible weighting scheme is critical, due to the strong effect of the weights on the final ranking (Greco et al., 2019;Yang et al., 2018).For example, in the case of the Technological Achievement Index introduced by Desai et al. (2002), changing the weights of some indicators reflected noticeable changes in the overall ranking with its consequent (political) implications (Cherchye et al., 2008).According to the OECD (2018), a "weight" represents a coefficient associated with a certain variable in the construction of CIs.In other words, weights represent the contribution of a variable to the overall CI, or to the sub-indices that constitute that overall CI.Therefore, the selection of the weights represents a major challenge in the process of constructing CIs.This challenge is frequently referred to as the 'Index problem' (Cox et al., 1992;Freudenberg, 2003).
To address this challenge a range of weighting schemes have been developed in the literature.For instance, (1) "No or equal weights" suggests that equal weights are assigned for all indicators, which is equivalent to saying that no weights are assigned.This is often labeled in the literature as an Attributes-Based Weighting System (Freudenberg, 2003).This approach among others has been used by the European Innovation Scoreboard (European Commission, 2019).However, this technique neglects the variation of the relative importance among indicators.Another solution is to adopt (2) Budget Allocation Processes.In this scheme, a set of 'n' points are given to a group of experts so they can distribute them across a group of indicators (Moldan & Billharz, 1997).For example, this scheme used in the estimation of the weights used in the e-Business Readiness Index introduced by Pennoni et al. (2005).However, a problem of inconsistency might occur if the number of indicators is larger than 10, or if the group of experts is not carefully selected (Saisana & Tarantola, 2002).Also, a critical aspect of this scheme is that it involves a high level of subjectivity.Another alternative is to (3) adopt data-driven weight-assigning techniques.In this case, and contrary to the previously mentioned weighting schemes that contain a high level of subjectivity in the arbitrariness of weights selection, statistical techniques such as PCA and DEA are claimed to be desirable as they entail a high level of objectivity in the decision-making (Decancq & Lugo, 2013;OECD, 2008).

Composite indicator aggregation
As argued above, characterizing and measuring complex phenomena through a 'simplistic' CI may lead to flawed results and conclusions.An alternative would be to develop several indices that would measure the same phenomenon from different perspectives.However, this alternative would increase the difficulty to interpret the results, particularly for non-specialized audiences such as the media, or policymakers.They would rather rely on a single number that includes all the indicators, which provides a simpler understanding of a complex phenomenon, even if this conclusion may be biased and hence lead to wrong (policy) implications (Saltelli, 2007).As a result, the use of CIs as a measurement tool is subject to debate.Moreover, the extent to which a CI represents such a phenomenon is also contested (Greco et al., 2019).
An obvious debate emerges in the literature between aggregators versus non-aggregators.Aggregators support building synthetic indices to describe a whole (complex) phenomenon, by combining indicators using a certain aspect to produce a bottom line, which will result in a meaningful outcome (Cobb et al., 1995;Gadrey & Jany, 2003;Osberg & Sharpe, 2002).In turn, non-aggregators consider that the previous aggregation results will be statistically meaningless, and the process must stop at the level of having a set of indicators without combining them to a single outcome, their objection to aggregation being the arbitrariness of the process of weighting and combining (Atkinson et al., 2002;Henderson, 1974).Nevertheless, it is worth stating that most of the widespread indices such as the Human Development Index (UNDP, 2019) adopt a methodological framework that uses aggregation.
Aggregation is the last step in the process of constructing a CI.According to the OECD (2008) aggregation methods can be broken down into three different categories: linear, geometric, and multi-criteria.Moreover, there is yet another categorization of aggregation to be considered, namely, 'compensatory' or 'non-compensatory' (Munda, 2005).Compensatory approaches occur when there is a trade-off between two perceptions of weights (Paruolo et al., 2013).Understandably, this trade-off might cause a fixed compensability between pairs of dimensions.This could happen when one of the dimensions might cause a loss for another (OECD, 2008).In this context, Munda (2012) argues that in a hypothetical sustainability index, the dimension of economic growth could compensate for the loss in the environmental dimension.To conclude, the Linear or Additive utility-based approach is the most frequently used among the approaches of compensatory aggregation (Saisana & Tarantola, 2002).
The non-compensatory multi-criteria approach is less frequently used, due to the simplicity of applying compensatory approaches (Greco et al., 2019).Nevertheless, the Condorcet-Kemeny-Young-Levenglick non-compensatory approach is regarded as a plausible alternative to the frequently practiced linear aggregation.This approach has been applied to the Environmental Sustainability Index introduced by Esty et al. (2005), and it has produced remarkable differences in the rankings reported by the original compensatory approach (Munda & Nardo, 2005).

Robustness of composite indicator
The robustness analysis aims to guarantee the quality and authenticity of a CI during the different levels of its construction, such as the theoretical and methodological framework development (Corrente et al., 2021;Saisana et al., 2005).Therefore, in the absence of robustness analysis, CIs may draw conclusions that would deliver a misleading message to the audience (Barbero et al., 2021;Billaut et al., 2010;Corrente et al., 2021;Saisana et al., 2005).However, the significance of robustness analysis is often neglected by most widespread CIs (OECD, 2008).To assure a high level of robustness in the image projected by the CI, some techniques such as uncertainty analysis or sensitivity analysis should be used (OECD, 2008).
Uncertainty analysis refers to the magnitude of changes that may occur in the final outputs (i.e., conclusions) of a CI as a result of changes in the construction stages, such as weights assigning or aggregation (Greco et al., 2019;Grupp & Schubert, 2010;OECD, 2008).In turn, sensitivity analysis deals with how much variance of the final outcomes is yielded due to these uncertainties (Grupp & Mogee, 2004;Saisana et al., 2005).Hereby, it can be concluded that there is paramount demand to assess the influence of the methodological issues amid the construction of CIs.One of the methods that can be beneficial to examine the robustness of CIs, is the classification of the RF.Such classification can be an indicator of the level of a unit (country) ranking.

Methodology
The dataset This research uses the dataset provided by the 2019 edition of the GII (Dutta et al., 2019).This is the last report before the COVID-19 pandemic, so the data reflecting the innovation performance of countries may have changed since then, due to the need to reallocate some resources to other areas of public action (e.g., health) given the health emergency.The dataset consists of 80 indicators (53 inputs and 27 outputs) collected from 129 countries.The data are available only in PDF format, so we had to transfer it into an MS-Excel sheet manually.The reliability of the data combines with the element of being presented by well-known international agencies (i.e., WIPO, INSEAD, Cornell University) and recognized by institutes and established bodies such as the UN Economic and Social Council (WIPO, 2021).Furthermore, GII indicators are derived from distinguished socio-economic data such as government effectiveness, tertiary education enrollment, global R&D companies' expenditure, and ICT usage.
When analyzing the data of the 2019 GII, there are only 16 indicators without missing data for all countries.The remaining indicators have different percentages of missing data points, where the highest percentage of missing data for one indicator (High-tech net exports) is 51.2%.Among the 64 indicators with missing data, 27 indicators have less than 5% missing data points, 11 indicators have 5-10% missing data points, 12 indicators have between 10 and 20% missing data points, 7 indicators have 20-30% missing data points, and the remaining 7 indicators have more than 30% missing data points.
To deal with the missing data, several steps were taken.Firstly, the countries were divided into four groups based on the level of income: high-, upper-middle-, lower-middle-, and low-income according to the World Bank.Following this, the mean of each indicator was calculated for each group.Secondly, the countries were divided into four groups again based on the Human Development Index (UNDP, 2019).Following this, the mean of each indicator was calculated for each group.Thirdly, each country was grouped with the five nearest neighboring countries, then the mean of each indicator was calculated for each group.Finally, the mean of the three means for each indicator was taken as an estimation for the missing data point.However, the previous steps did not solve the problem entirely, leaving the data with 40 missing data points, which were finally imputed by linear regression modeling.

Data envelopment analysis
DEA is a linear non-parametric programming model developed by Charnes et al. (1978) in the field of operations research, which is often referred to as the CCR model.The idea behind this model is to evaluate the relative efficiency for homogenous decision-making units (DMUs) (i.e., companies, banks, universities, countries, etc.) by giving them scores between 0 and 1.Specifically, DMUs with a score = 1 are considered as "efficient" and DMUs with a score < 1 are considered as "inefficient".In the 1980s, Banker et al. (1984) extended the method to develop a model that deals with multiple inputs and several outputs, which the literature refers to as the BCC-model.DEA relies on a frontier created from the observed DMUs by utilizing the so-called best-practice, based on the minimum extrapolation principle (Thanassoulis, 2001).Shen et al.(2013) conclude that DEA provides several features to the field of CIs such as: (1) it is a means to combine multiple indicators for countries without any prior awareness about the tradeoffs (i.e., the weights); (2) the country itself obtains its own best possible indicators weights; (3) if a country is underperforming compared to other countries, this cannot be attributed to the unfair weighting scheme, since every country has been put in its most beneficial position vis a vis all the other countries, (4) and any other weighting scheme would have generated lower weighting scores for that particular country.Additionally, DEA evaluates the relative efficiency of every country, taking into consideration the performance of all other countries (Cherchye et al., 2008).For the above-mentioned features, DEA has been broadly utilized to examine CIs to name but a few: Technology Achievement Index (Cherchye et al., 2008), the Macro-economic Performance Index (Ramanathan, 2006), the Human Development Index (Despotis, 2005) and the Knowledge Economy composite indicator (Guaita Martínez et al., 2021).
In the context of CI construction, the literature has broadly suggested an adjustment to the classic DEA formulation by considering all the indicators to be treated as outputs (Cherchye et al., 2008;Guaita Martínez et al., 2021;Hermans et al., 2008;Martin et al., 2017).This adjustment is known as the "Benefit of doubt" approach (Cherchye et al., 2007), and it shifts all input variables to become outputs, compromising the inputs with a dummy variable equal to one.It was initially adopted by Melyn and Moesen (1991) as a method to construct CIs to evaluate macroeconomic performance.This approach is to be considered if the underlying structure of the evaluated composite phenomenon is not definitive or if there is disagreement regarding the construction methodology, or if the input indicators are considered to be "achievements" (Cherchye et al., 2007).All these concerns are valid for any CI that endeavors to measure innovation performance.For example, Crespo and Crespo (2016), by applying a fuzzy-set qualitative comparative analysis, conclude that none of the GII input pillars is a necessary condition for anticipating high innovation performance.Meanwhile, in the high-income countries, only two of the pillars (i.e., Infrastructure and Human capital and research) are sufficient to secure better innovation performance.Over and above, Jankowska et al. (2017), Edquist et al. (2018), and Barbero et al. (2021) among others, evidence that the common assumption that the higher theGII input indicators, the higher the GII output indicators, is not confirmed.
For the above-stated, this paper relies on the DEA, using the benefit of doubt approach (see Eq. 1), with one dummy input variable equal to 1, and 21 output variables.Particularly, these 21 output variables will be generated by considering the linear combination of the indicators under each sub-pillar of the GII (i.e., V q = (u 1 .I 1 + u 2 .I 2 + • • • + u n .I n ), I n = indicators under sub-pillar V q , q = (1, . . ., 21) , n = number of indictors under each sub-pil- lar, u n = the weight generated for indicator n by using the PCA one-component loadings for the sub-pillar that the indicator belongs to.Eventually, the linear combination of indicators under the sub-pillar "Political environment" will produce variable 1, and the linear combination of indicators under the sub-pillar "Regulatory environment" will produce variable 2, etc. (see Table 1).This PCA-DEA approach has been introduced by Adler and Yazhemsky, (2010). (1) (2) where I qc is the normalized value of the q th individual variable (q = 1, . . ., Q) for the country c (c = 1, . . .M) and w qc the corresponding weight (Cherchye et al., 2004).While I * is the "benchmark performance" (i.e., the hypothetical country that maximizes the overall performance (OECD, 2008)).However, due to the nature of the DEA, all efficient countries will obtain the same efficiency score equal to one (i.e., DMUs that lay at the frontier).Consequently, at least for these countries, the ultimate desired ranking will not be entirely discriminating.This limitation of DEA is known in the literature as the "discrimination power problem" (Adler & Yazhemsky, 2010;Barbero et al., 2021;Hatefi & Torabi, 2010).To address this problem, a sequence of sub-DEAs will be executed over the efficient countries only, by dividing the output variables for these countries into subsets according to the GII pillars.For example, the first sub-DEA will be performed over the output variables: "Political environment", "Regulatory environment", and "Business environment", with a dummy input equal to 1.This will be repeated seven times for the seven pillars.Finally, the total efficiency score for each country will be calculated as the average of the seven sub-DEAs scores.

Random forests
RF is a non-parametric supervised learning statistical method, introduced by Breiman ( 2001).It has been proven to be a reliable method for classification problems (Hamidi & Berrado, 2018;Hastie et al., 2009).RF develops a random bootstrap of a set of data, performing multiple decision trees according to identified features (variables), eventually by so-called 'Bagging' to vote for the best classification (Hastie et al., 2009).The relationship between CIs and RF emerged recently in the fields of data mining and machine learning, to examine the robustness of the classifiers (Setiawan et al., 2019).In this paper, the idea behind the use of RF is to assure the robustness of the PCA-DEA results, by using the 21 variables in Table 1 to classify the countries and see to what level this classification matches the PCA-DEA results.Another quality RF can provide, is the ability to assess the 'importance' of every variable in the production of the classification (i.e., what are the variables that played an effective role in the classification of the countries?).Before running the RF, all countries were divided into three groups: (1) countries in the top quartile of the PCA-DEA ranking and labeled as 'Efficient'; (2) countries in the lowest quartile of the PCA-DEA ranking and labeled as 'Highly inefficient' .(3) Countries in the two remaining middle quartiles labeled as 'Inefficient".

Principal component analysis
The purpose of PCA is to generate the weights for the linear combination V q .Thus, a one-component PCA was performed for every GII sub-pillar separately.The loading of every indicator on that component is considered to be the weight of that indicator in the linear combination.The results show that indicators in the sub-pillar Political environment have gained an equal weight of 0.97.This case of "equal weights" among all the indicators in a given sub-pillar has occurred in several sub-pillars such as Business Environment, Research & Development, Information & Communication Technologies, Investment, and Online creativity.These sub-pillars are across the board of the GII input and output.Meanwhile, Indicators such as GERD financed by abroad, Growth rate of PPP$ GDP/worker, and Creative goods exports have gained considerably low weights of less than 0.1.Table 1 provides the value of weights associated with each indicator.

Data envelopment analysis
After reproducing the 21 variables (sub-pillars), we applied the benefit of doubt DEA.The efficiency scores yielded by the model PCA-DEA are presented in Table 2.It shows that 31 countries obtain a relative efficiency score = 1 and the rest of the countries obtain relative efficiency scores < 1. Concluding that, if all the GII indicators were to be considered as outputs (i.e., the benefit of doubt), 31 out of 129 countries tend to perform efficiently with regard to national innovation performance.On the other hand, the national innovation systems in the remaining 98 countries that obtained a score < 1 are considered to be performing inefficient at different levels, the closer the score to 1, the better the country is performing.The case of equal scores for the 31 countries is acknowledged because of the discrimination power problem of the DEA (see Sect.Composite indicator weighting).
However, to reach the ultimate final ranking for all the countries, the case of 31 equal efficiency scores must be resolved.To do that, only for the top 31 countries, we examine the efficiency of each country under each GII pillar separately.In practice, this implies running the previous benefit of doubt DEA model for every country seven times (one for every pillar).In each pillar, we consider the output variables to be the variables from Table 1 under that specific pillar and the input being equal to 1.This step evaluates the efficiency of every country with regard to every GII pillar one by one.Eventually, the average of these seven sub-DEAs for every country will be the discriminating value to form the final ranking among these 31 countries.As far as the GII pillars are individually concerned, the results show that the performance of the national innovation system in Switzerland is the best, followed by the USA and the Republic of Korea (see Table 3).Lastly, by replacing the score values = 1 in Table 2 with the average value of the seven sub-DEAs for these 31 efficient countries, a final ranking for all countries can be generated (see Table 4).
In pursuance of the comparison, Table 4 shows the PCA-DEA ranking of all countries compared to the GII 2019, where the negative sign of difference indicates moving backward and the positive sign indicates moving forward in the new ranking.The comparison adopts two aspects: (1) the absolute value of the difference between the country's position in the GII rank and the PCA-DEA rank; (2) the distribution of the absolute value of difference over the GII rank.
The comparison shows a noticeable difference in the ranking between the GII 2019 and the PCA-DEA score.Specifically, the average absolute value of the difference  (2023) 12:61 between the two ranks is 10.9 positions; whereas some countries change their positioning by more than 40 positions of ranking, such as India and Macedonia.Other countries such as Switzerland maintain the same position.Moreover, the results show that countries that lay in the middle of the GII ranking display the greatest difference in the absolute value in their positions (i.e., the greatest change in the ranking both ways, forward and backward), while countries that lay at the top or bottom of the GII made less change in rankings (see Fig. 1).

Random forests
To assure the robustness of the PCA-DEA results used in this paper, the RF technique was applied.To elaborate, all countries were divided into three groups.The first group consists of the 31 efficient countries in the PCA-DEA rank, this group being labeled as "Efficient".The second group consists of the lowest 30 countries in the PCA-DEA rank, this group is labeled as "Highly inefficient".Lastly, the third group consists of the  remaining 68 countries in the middle of the PCA-DEA rank, which is labeled as "Inefficient".The application of the RF shows that the PCA-DEA ranking matches its classification to 88%.In detail, for 26 out of 31 efficient countries in the ranking of PCA-DEA, the RF also has classified as efficient, while 5 countries shifted to be inefficient.As regards the highly inefficient countries, 26 out of the lowest 30 countries in the ranking of PCA-DEA, were classified as highly inefficient by the RF, while 4 countries were shifted to be inefficient.Finally, RF classified 62 out of the 68 countries in the middle of the PCA-DEA ranking as inefficient, while 3 countries shifted to efficient, and 3 countries shifted to highly inefficient (see Table 5).
Additionally, RF highlights the importance of each variable during the classification.For example, the input sub-pillars "Political environment", "Education", and "Research & development"' play a significant role in determining the rank of the country, while the sub-pillars "Investment", "Trade, competition, and market scale" play a minimal role.Likewise, the output side sub-pillars such as "Knowledge impact" and "Creative goods and services" are more important than "Knowledge diffusion" and "Intangible assets" (see Table 6).

Conclusions
Innovation is one of the economic factors leading to societal progress, technological development, and economic growth.As a result, the measurement of innovation performance and capacity has received increasing attention, not only at the academic level, but also at the policy and societal levels.Composite Indicators (CIs) such as the Global Innovation Index (GII) and the European Innovation Scoreboard are well recognized and accepted instruments for this task.Although they represent a vehicle of communication and an elevator for the awareness of innovation, they also play a significant role in shaping and directing innovation policies (Edquist et al., 2018).
The literature dealing with the construction of the CIs has had an important evolution in recent years, with new contributions emphasizing that the methodologies underlying these CIs should be considered carefully, in terms of weighting, aggregation, and robustness.In this paper, we introduce a data envelopment analysis (DEA) model that relies on the weights provided by the application of principal component analysis (PCA).This PCA-DEA model is applied to the 2019 edition of the GII, which measures innovation at the national level in 129 countries worldwide.The paper aims to examine the sensitivity of the results provided by the GII, to variations in the methodology.In addition, the robustness of the results provided by this alternative methodology is tested using the random forest (RF) methodology.The rationale for using data-driven techniques such as PCA, DEA, and RF, lies in that they eliminate the subjective intervention during the construction of the CI.
The PCA-DEA model used in the paper provides the relative efficiency of innovation of the countries considered in the GII, which has enabled us to draw a final ranking.This PCA-DEA model relies on the "Benefit of doubt" technique, which measures the effectiveness of the national innovation systems, since all the countries are attributed a single input with a value equal to one (i.e., no minimization function for the input side).
The comparison between the GII rank and PCA-DEA rank shows that the countries that lay in the middle of the GII rank have made a considerable change in their positions in terms of the absolute value of difference.Meanwhile, countries at the top or bottom of the GII rank have made a minimal change, despite the rank of some of the top countries such as Singapore dropped 10 positions on the PCA-DEA model.This is explained by the attributes-based weighting system (i.e., no weights or equal weights).The GII uses the unweighted average of the GII sub-indices to calculate the general index.Instead, this paper relies on the objective weights provided by the PCA to each of the GII sub-indices.
One of the conclusions of the paper is that the GII rank appears to be robust for the countries located at the top and the bottom of the ranking, but not for the countries in the middle.In other words, the positioning of the countries in the middle of the GII rank is more sensitive towards the modification of the construction method of the GII, compared to the countries at the top or the bottom.This can be attributed to the influence of the subjective means of generating and assigning the weights for the indicators such as the Budget Allocation scheme, since there is no consensus about the importance of an indicator over the others.In addition, what seems to be important for a certain group of countries is not necessarily important for other countries (Stavbunik & Pelucha, 2019;Zabala-Iturriagagoitia et al., 2007a).
As regards the new ranking provided by the alternative PCA-DEA model, the results of the RF methodology applied in the paper reveal that the new ranking is statistically robust.As a matter of fact, the RF attributes 88% of the countries to the right subgroups.In addition, the RF classification also helped to assess the relative importance of each variable (sub-pillar) in explaining the factors that determine whether each country is efficient, inefficient, or highly inefficient.
This paper represents, to the best of our knowledge, the first contribution in which the robustness of the GII is assessed, using alternative methodologies such as the PCA and the DEA.Consistent with the extant literature, particularly related to the European Innovation Scoreboard (Barbero et al., 2021;Edquist et al., 2018;Zabala-Iturriagagoitia et al., 2007a, 2007b), our results also reveal that the methodological election has a direct impact on the results provided by composite indices.Accordingly, further editions of the GII should seek to introduce alternative methodologies to the existing data, in order to increase the robustness and reliability of the rankings provided by the GII.This would allow for emphasizing different dimensions of the innovation system, highlighting those areas in which each country may show relative strengths, and weaknesses.We contend that these alternative methodologies would provide additional information to policymakers, so that effective policies can be adopted on each innovation system.
Finally, one of the limitations of this research paper is that we used only the year of 2019 as a statistical point.Further research could thus explore the extent to which potential deviations may also emerge when larger periods of time are considered.This is an emerging field of research where limited evidence exists to date (e.g., Aparicio et al., 2020;Zabala-Iturriagagoitia et al., 2021), but which shows enormous potential to provide a dynamic assessment of innovation performance worldwide.

Table 1
Output variables

Table 2
PCA-DEA efficiency scores

Table 3
The results of seven sub-DEAs to discriminate the 31 relative efficient countries

Table 4
Comparison between the final PCA-DEA rank and GII 2019 rank for all the countries

GII 2019 Diff Abs. value Country PCA-DEA GII 2019 Diff Abs. value
Distribution of absolute of the difference over GII 2019 Alqararah Journal of Innovation and Entrepreneurship

Table 5
Random forests classification results

Table 6
Random forests variable importance *MDA is the Mean Decrease Accuracy.**MDG is the Mean Decrease in Gini