\documentclass[a4paper,10pt]{article} \usepackage[utf8]{inputenc} \usepackage{float} \usepackage{listings} %%\usepackage{inconsolata} \usepackage{hyperref} \usepackage{subfig} \renewcommand*{\familydefault}{\sfdefault} \addtolength{\oddsidemargin}{-.875in} \addtolength{\evensidemargin}{-.875in} \addtolength{\textwidth}{1.75in} \addtolength{\topmargin}{-.875in} \addtolength{\textheight}{1.75in} \setlength\parindent{0pt} <>= require(knitr) options(width=60) @ %opening \title{TopKLists: Analyzing multiple ranked lists} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Usage of TopKLists} \author{Michael G. Schimek, Eva Budinska, Jie Ding, Karl G. Kugler, \\ Vendula Svendova and Shili Lin} \begin{document} %%\opts_chunk$set(concordance=TRUE) \maketitle \tableofcontents \newpage \section{Introduction} \label{sec:introduction} The ranking of distinct items has become mainstream in recent years. Examples include Web search engine results for query terms, institution league tables in higher education, preference rankings of brands, betting results in sports, results from microarray platforms in biotechnology, and meta-analysis of multiple study findings in medicine, among many others. The rank position of an object or institution might be the result of measuring the strength of evidence or of assessment based on expert knowledge or preference. The assessors can be a person or a technical device. Typically, such lists are between several thousand and several tens of thousands of items in length. However, only a comparably small subset of \emph{k} top-ranked items is usually informative. These are characterized by a strong overlap of their rank positions when they are ranked by several independent assessors. There are two basic statistical tasks: (i) Identification of the top-\emph{k} most conforming items. For this task, two lists are analyzed together (multiple lists in a pairwise manner). (ii) Calculation of a consolidated top-\emph{k} sublist with a new optimized ordering of the conforming items from two or more lists. For long ranked lists, (i) is a prerequisite of (ii) because for any kind of rank aggregation the top-$k$ list lengths of the individual top-$k$ lists need to be specified. For these two tasks this package offers several options which can be selected from three modules that are provided. Various options of high practical value are supported by a graphical user interface (GUI): \begin{enumerate} \item \texttt{TopKInference} provides exploratory nonparametric inference for the estimation of the top \emph{k} list length of paired rankings; \item \texttt{TopKSpace} provides several rank aggregation techniques (Borda, Markov Chain, and Cross Entropy Monte Carlo) which allow for the combination of input lists of different lengths, that may come from different underlying sets (spaces); \item \texttt{TopKGraphics} provides a collection of graphical tools for the visualization of the inputs to, and the outputs from, the other modules. \end{enumerate} \textbf{Whenever you refer to the package \texttt{TopKLists}, please cite \cite{haschi}, \cite{linding} and \cite{schimetal}}. In the following sections, we illustrate the usage of these modules. \subsection{Input data} The input data should be either lists of ranked object names (see the breast cancer example below), or lists of their actual rank positions. Data should be prepared in a \texttt{data.frame} format (for the truncation functions \texttt{compute.stream}, \texttt{j0.multi}, \texttt{TopKListsGUI}, \texttt{deltaplot},...) or a \texttt{list} format (for the aggregation functions \texttt{Borda}, \texttt{MC} and \texttt{CEMC}). The truncation functions allow for incomplete data sets (NA values). After loading the data, user can continue working in the console using all the functions directly, or start the graphical user interface (GUI) by running the \texttt{TopKListsGUI} function. \subsection{The breast cancer data} \label{subsec:breastdata} For illustrating the usage of this package, we selected three breast cancer microarray datasets (MDCC, TransBig and Pusztai - each list of the length 917) that include information on estrogen receptor positivity (ER+ vs ER-), which is one of the prognostic markers for breast cancer (for the accession numbers see Table \ref{tab:data}): \begin{table}[h!] \begin{center} \label{tab:data} \begin{tabular}{|c|c|c|c|} \hline % after \\: \hline or \cline{col1-col2} \cline{col3-col4} ... NAME & PubMed ID & GEO & Reference\\ \hline MDCC & 20676074 & \href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20194}{GSE20194} & \cite{shi}\\ TransBig & 17545524 & \href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7390}{GSE7390} & \cite{desmedt} \\ Pusztai & 20829329 & \href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20271}{GSE20271} & \cite{tabchy}\\ \hline \end{tabular} \caption{Example dataset: names, PubMed IDs, GEO accession numbers and references} \label{tab:data} \end{center} \end{table} The presence of an estrogene receptor on cancer cells discriminates between two groups of breast cancer and therefore we expected ESR1 (estrogen receptor 1 gene) to be in the first place among the differentially expressed genes. Given the study designs, we also expected a substantial overlap in the top ranked genes. We begin by loading the data set and displaying the first part of the data frame, as described below. <>= library(TopKLists) data(breast) head(breast) @ \section{Methods} \label{sec:methods} \subsection{TopKInference} \label{subsec:topkinference} The nonparametric inference method of \cite{haschi} for the truncation of paired ranked lists forms the core part of the \texttt{TopKInference} module. The associated iterative algorithm, as implemented here, allows the estimation of the length, $k$, of a top-$k$ list in the presence of irregular and missing assignments. Overlap of rank positions in two input lists is represented by a sequence of indicators, $I$, where $I_j=1$ is the ranking, given by the second assessor to the object ranked $j$ by the first assessor, is not more than $\delta$ index positions distant from $j$, and otherwise $I_j=0$. The vector of indicators is represented by the \texttt{Idata} variable. The variables $I_j$ are assumed to follow a Bernoulli random distribution. This implies independence, which is motivated by $k\ll N$ and a strong random contribution due to irregular assignments in real data. However, \cite{haschi} could prove that their theoretical results obtained under the assumption of complete independence also applies in the situation of moderate $m$-dependence. Simulation study evidence (unpublished material) is in full support of these theoretical findings: in fact, even under substantial list dependence, stable truncation results can be obtained from \texttt{TopKInference}. \\ The algorithmic solution employed by \texttt{TopKInference} in the event of there being more than two ranked lists is outlined in Figure \ref{fig:topkinference}. There, the principle for the calculation of an overall index $k^\ast$ (a function of the individual $k$'s from the $\ell$ lists $L_i$; the maximum \texttt{maxK} is the default) based on all pairwise comparisons is outlined. Having obtained such an overall index, we arrive at truncated lists $T_i$. They can either be aggregated by graphical means (the \texttt{aggmap} of the \texttt{TopKGraphics} module) or by stochastic rank aggregation (\texttt{TopKSpace} module). Details of the approach taken for multiple lists can be found in \cite{schimybu}. \begin{figure} \centering \includegraphics[width=0.35\textwidth]{TopKInferenceFlowmap.pdf}\\ \caption{The inference concept to obtain $\ell$ truncated consensual lists $T_i$ from $\ell$ full ranked lists $L_i$} \label{fig:topkinference} \end{figure} \subsubsection{Construction of a dataset and execution of the inference procedure} In order to run the examples in this vignette, the TopKLists package must first be loaded: <>= library(TopKLists) @ The truncation point $j_0$, where noise takes over for a pair of ranked lists (i.e., the first index position after the end of the top-$k$ list), can be estimated for any prespecified distance $\delta$ and the chosen tuning parameter values $C$ and $\nu$. It should be noted that in the R source code $\delta$ is denoted by d, $C$ by const and $\nu$ by v. In this example we simulate a dataset with an assumed top length of $k=30$, hence the truncation point of $j_0=31$. <>= k = 30 set.seed(123) x = c(rep(1,k), rbinom(100, 1, 0.2)) x @ Now let us estimate $j_0$ for different values of $\nu$ using the \texttt{compute.stream} function. This outputs a list of estimated $j_0$ values, as well as other details of the algorithm's convergency. <>= v.vect=seq(2,length(x), by=2) #setting up a vector of the nu values resF=c() for (v in v.vect) { res=compute.stream(x, const=0.5, v) resF=rbind(resF,c(v,paste(res))) } colnames(resF)=c("v", "j0_est", "k","reason.break", "Js", "v.vector") head(resF) table(resF[,2]) @ The \texttt{compute.stream} function outputs the index \texttt{j0est}, where the information between the lists degenerates into noise; index \texttt{k=j0est-1}; the sequence of estimated \texttt{j0}s in each run ($J_s$); the reason why computation has ended (\texttt{reason.break}); and the preselected value of the parameter $\nu$ (\texttt{v}). \subsubsection{Visualization of the truncation results} \label{different_nu} The following plot (Figure~\ref{fig:truncPlot}) summarizes the obtained estimation results $\hat{j}_0$ for the specified range of pilot sample sizes $\nu$ and the assumed point of degeneration $j_0=31$. Small values of $\nu$ are the most appropriate here. <>= plot(resF[,1], resF[,2], pch=19, ylim=c(25, 40), xlab=substitute(nu), ylab=substitute(paste(hat(j)[0]))) abline(a=31, b=0, col="red") lines(resF[,1], resF[,2]) @ \subsection{TopKSpace} \label{sec:topkspace} The principle of the \texttt{TopKSpace} module is to consolidate information from the $l$ top-\emph{k} lists to arrive at an aggregate list, $AL$. As shown in Figure \ref{fig:topkspace}, the top-\emph{k} lists ($L_1, L_2, \ldots, L_l$) may not only be of different lengths, they may also come from studies or assessments that consider different sets of objects, hence the underlying spaces ($S_1, S_2, \ldots, S_l$) from which the top-\emph{k} lists are derived may actually be different. The goal of the inference in \texttt{TopKSpace} is to therefore find the top-\emph{k} list, $AL$, from the aggregate new space ($\cup_{i=1}^l L_i$), such that the weighted sum of distances between each of the input lists and $AL$ will be the minimum among lists of the same length. Two distance measures, Kendall's $\tau$ and Spearman's footrule, are available in the package. Both take the differences in the underlying spaces into account \cite{lin2010}. There are three common assumptions about the underlying spaces: {\em common-space} (all top-\emph{k} lists come from a single common space), {\em underlying space-dependent}, i.e. {\em assessment-} or {\em platform-dependent} (using the known spaces from which the top-\emph{k} lists were generated), and {\em top-k-space} (treating each top-\emph{k} list as its own space). Since underlying space-dependent represents the true underlying scenario, this method is recommended if such information is available. There are three classes of algorithms implemented in \texttt{TopKSpace}, namely Borda's method, Markov Chain (MC) algorithms \cite{lin}, and a Cross Entropy Monte Carlo (CEMC) method taking advantage of the new Order Explicit Algorithm (OEA) as described by \cite{linding}. The Borda and Markov Chain methods consist of heuristic algorithms which do not directly optimize the objective function (i.e., minimizing the weighted distances), whereas the CEMC method employs a Monte Carlo search procedure for achieving this optimization. Borda and Markov Chain algorithms run much faster than the Cross Entropy Monte Carlo algorithm, however the latter usually achieves better results. Nevertheless, simulation studies indicate that taking the underlying space into consideration has a much greater impact than using different algorithms. These three algorithms are implemented in three top-level functions within \texttt{TopKSpace}: \texttt{Borda}, \texttt{MC}, and \texttt{CEMC}. There are also two plotting functions: \texttt{Borda.plot} and \texttt{MC.plot} that help the user to visualize and compare the results. \begin{figure}[h] \centering \includegraphics[width=0.5\textwidth]{TopKSpaceSchema.pdf}\\ \caption{The optimization concept for the aggregation of $l$ full ranked lists $L_i$ into one list $AL$ under space consideration} \label{fig:topkspace} \end{figure} \subsubsection{Construction of three input lists and underlying spaces} \label{sec:data} Let us first produce the following three ranked lists of different lengths and their underlying spaces. Although this example is contrived, it is realistic in terms of the lengths and the number of top-\emph{k} lists. <<>>= set.seed(1234) L1=paste("Obj",1:30,sep="") L2=paste("Obj",c(1:10,31:40,11:15),sep="") L3=paste("Obj",c(1:10,16:20,11:15),sep="") input=list(L1,L2,L3) space1=space2=space3=paste("Obj",1:40,sep="") space=list(space1,space2,space3) @ As can be seen from the compositions of the three lists, they are of different lengths (30, 25, and 20 for L1, L2, and L3, respectively). Further, the aggregate candidate set is objects Obj1-Obj40, although two of the underlying spaces are larger, indicating that objects Obj41-Obj50 were not selected in the top-\emph{k} of the two corresponding lists. Fifteen of these 40 objects appear in all 3 lists, with objects Obj1-Obj10 ranked 1-10, in that order, in all three lists; objects Obj11-Obj15, however, have different rankings (they are ranked 11-15 for L1, 20-25 for L2, and 16-20 for L3). Five objects, Obj16-Obj20, appear in two of the lists, with rankings of 16-20 for L1 and 11-15 for L3. The remaining 20 objects, Obj21-Obj40, appear on only one list (objects Obj21-Obj30 are ranked 21-30 in L1, while objects Obj31-Obj40 are ranked 11-20 in L2). Since the number of potential aggregate top-40 lists is more than $8 \times 10^{47}$, enumerating all of them to find the ``true'' answer is not possible, but it would be reasonable to make a fairly accurate guess. There should be overwhelming information for ranking objects Obj1-Obj10 in the top-10. Objects Obj11-Obj15 should come next, followed by Obj16-Obj20. For the remaining of the objects, Obj31-Obj40 should be ranked ahead of Obj21-Obj30. In the following subsection, we will demonstrate the use of the three classes of algorithms (\texttt{Borda}, \texttt{MC}, \texttt{CEMC}) for aggregation of these three ranked lists. We will also demonstrate the use of an objective evaluation criterion to compare the relative performances of different algorithms. \subsubsection{Function calling} \label{TopKSpace_function_calling} \paragraph{The Borda algorithm} Four Borda scores are implemented in the \texttt{Borda} function of \texttt{TopKSpace}, namely the arithmetic mean (ARM), median (MED), geometric mean (GEO), and L2-norm (L2N). The output is a list with two elements: \texttt{TopK} provides the aggregate rankings and \texttt{Scores} gives the corresponding Borda scores for each of the four functions. Of the three arguments in \texttt{Borda}, only the first one ({\em input}) is required. If argument {\em space} is not provided, then the underlying space is assumed to be common among all top-\emph{k} lists implicitly, and the union of the top-\emph{k} lists is treated as the common space. If argument {\em k} is not specified, then the full ranked list (all elements in the union) is outputted. <<>>= outBorda=Borda(input,space) # "space" is explicitly specified; underlying space-dependent @ <<>>= outBorda1=Borda(input) #"space" is not specified; all lists are assumed to come from the common space (objects Obj1-Obj40) @ <<>>= outBorda2=Borda(input,space=input) # "space = input" indicates that this is the top-k space @ <<>>= sum(outBorda$Scores-outBorda1$Scores) sum(outBorda$Scores-outBorda2$Scores) @ Comment 1: Objects Obj1-Obj40, union of the three top-\emph{k} lists, are contained in all three spaces. As such, the results for which the spaces were explicitly specified should be the same as those with the common space assumption, which was exactly what we saw above. On the other hand, the underlying space-dependent results were different from the results assuming top-\emph{k} spaces (i.e., all lists were full ranked lists). <<>>= as.list(outBorda$TopK) @ Comment 2: All four scoring functions ranked objects Obj1-Obj10, in that order, in the top-10. However, despite our expectation that objects Obj11-Obj15, in that order, should be ranked 11-15, followed by objects Obj16-Obj20, the results were more of a mix of these two groups, but with the correct relative orders preserved within each group. The results are quite similar among the four scoring functions. This is not surprising due to degradation of ranking information beyond the top-10. Such information degradation can be seen by plotting the Borda scores as described in the following. \\[2ex] Plotting the Borda scores against the ranking can frequently reveal when information for ranking starts to diminish. For example, Figure~\ref{fig:borda}(a) shows that for the underlying space-dependent approach, a large gap appears between the Borda scores at rankings 10--11, and that the changes of the scores also slow down after this point. This can be similarly observed for the scores under the top-\emph{k} space assumption (Figure~\ref{fig:borda}(b)). \\ <>= Borda.plot(outBorda, k=40) # plot scores from underlying space-dependent analysis Borda.plot(outBorda2, k=40) # plot scores from top-k space analysis @ Note that the first argument of the \texttt{Borda.plot} function is the output from the Borda function, which is mandatory. If the number of ranked objects (second argument) is not specified, the plot will show scores for all objects in the union set. \\ \quad\\ %% \begin{figure}[h] %% \begin{center} %% %@ %% \includegraphics[scale=0.6]{bordaPlot.pdf}\\ %% \end{center} %% \caption{Plot of Borda scores against ranking: (a) results from platform-dependent %% analysis; (b) results from top-k space analysis.} %% \label{borda} %% \end{figure} \paragraph{Three MC algorithms} Three Markov Chain (MC) algorithms are implemented in the \texttt{MC} function in \texttt{TopKSpace}: MC1 (spam sensitive), MC2 (majority rule), and MC3 (proportional). Among them, \texttt{MC3} may be more appropriate for multi-platform omics problems given the potential of unique features of each data type. The output is a list with two elements for each of the \texttt{MC} algorithms (aggregate top-\emph{k} list: \texttt{MC1.TopK}, \texttt{MC2.TopK}, or \texttt{MC3.TopK}; stationary probability distribution: \texttt{MC1.Prob}, \texttt{MC2.Prob}, or \texttt{MC3.Prob}). Of the arguments in \texttt{MC}, only the first one, {\em input}, is required. If argument {\em space} is not provided, it is assumed that the underlying space is common among all top-\emph{k} lists, and the union of the top-\emph{k} list is supplied as the common space. If argument {\em k} is not specified, then the full ranked list (all elements in the union) is outputted. The other two arguments ($a$ and $delta$; the latter different from $\delta$ in \texttt{TopKInference}) are tuning parameters, whose default values were shown to work well in a number of examples. <<>>= outMC=MC(input,space) # "space" is explicitly specified; underlying space-dependent @ <<>>= outMCa=MC(input,k=30) # "space" is not specified, so it is the same as common space (O1-O40) @ <<>>= outMCb=MC(input,space=input) # "space = input" indicates that this is the top-k space @ <<>>= sum(outMC$MC2.Prob-outMCa$MC2.Prob) @ Comment 1: Similar to the discussion for running the \texttt{Borda} function, the results for which the space was explicitly specified were expected to be the same as those obtained under the common space assumption, which was exactly what we saw above (the two stationary distributions were identical). On the other hand, the common-space results were different from those assuming top-\emph{k} space (there were large discrepancies in the aggregated top-\emph{k} lists). <<>>= list(outMC$MC1.TopK, outMC$MC2.TopK, outMC$MC3.TopK) @ Comment 2: All three \texttt{MC} algorithms ranked objects Obj1-Obj10, in that order, in the top-10. \texttt{MC2} also ranked the next two groups, Obj11-Obj15 and Obj16-Obj20 in the correct order. However, the remaining objects (Obj21-Obj40) were ranked in the reverse of the expected order. \texttt{MC3}, on the other hand, had all the correct ordering except a transposition between the order of objects Obj15 and (Obj16, Obj17). The stationary probabilities can be visualized applying the plotting function (described in the following), which can be used to detect information degradation heuristically. \\[2ex] The stationary probabilities can be plotted by <>= MC.plot(outMC) @ A plot of the ordered stationary probabilities versus the ranking using the above command can contain useful information regarding the relative rankings of objects. For example, Figure~\ref{fig:equil} shows that for \texttt{MC2}, there was considerable information for ranking objects in the top-20, but there was practically no information ranking any object after that position (all stationary probabilities are the same). This was the reason why the rankings for objects Obj21-Obj40 were completely wrong as we saw in the output. On the other hand, for \texttt{MC1}, there was less information for ranking objects beyond the top-10 compared to \texttt{MC2}, thus objects Obj11-Obj20 were not ranked correctly. \texttt{MC3} appears to be a compromise between \texttt{MC1} and \texttt{MC2}, leading to a better aggregate ranked list. %% \begin{figure}[h] %% \begin{center} %% \includegraphics[scale=0.8]{equil.pdf}\\ %% \end{center} %% \caption{Equilibrium probabilities} %% \label{equil} %% \end{figure} Note that the first argument of the \texttt{MC.plot} function is the output from the \texttt{MC} function, which is mandatory. If the number of ranked objects (second argument) is not specified, the plot will show probabilities for all objects in the union set.\\ \quad\\ \paragraph{The CEMC algorithm} The Order Explicit Algorithm (OEA) in \cite{linding} is implemented as the \texttt{CEMC} function in \texttt{TopKSpace}. Of the three main input arguments in \texttt{CEMC}, only the first one ({\em input}) is required. If argument {\em space} is not provided, then all the underlying spaces are assumed to be a common one, and the union of the input top-\emph{k} lists is supplied as the common space. If argument {\em k} is not specified, then the full aggregate ranked-list (all elements in the union) is outputted. The function also has a number of other arguments, all tuning parameters, that can be set to run the OEA more efficiently. The default specified in the \texttt{CEMC} function constitutes a set of values that seem to work well for the examples/data that we have analyzed in \cite{linding}, but it is always a good idea to run OEA with multiple sets of tuning parameters to increase the chance of finding the global maximum; see \cite{linding} for some recommendations on how to choose these parameters. In fact, the output of \texttt{CEMC} contains the set of tuning parameters used in the current run. If users are unsure of what to use, they are recommended to run \texttt{CEMC} without setting any parameter and use the parameter file in the output ({\em input.par}) to fine-tune the parameters and apply them as input for a more refined analysis. For the convenience of the user, after the parameter values are modified, the file can be directly supplied as an input argument without having to enter the value of each tuning parameter separately. Other than {\em input.par}, there are two more elements in the output list. One is the aggregate top-\emph{k} list (\texttt{TopK}). The other is the multinomial probability matrix (\texttt{ProbMatrix}; each column represents the probability vector of a multinomial distribution and thus sum to 1, albeit with small rounding errors), from which the aggregate top-\emph{k} list is determined. <<>>= set.seed(12345) outCEMC=CEMC(input,space,N=4000,N1=400) # "space" is explicitly specified; underlying space-dependent @ <<>>= list(outCEMC$TopK) outCEMC$ProbMatrix[1:5,1:5] @ Comment 1: From the submatrix shown above for \texttt{CEMC}, it can be seen that item one is ranked no. 1, since the probability vector in the first column for item one is almost 1, but very small for the rest of the items. Similarly, item 2 is ranked no. 2, and so on. <<>>= outCEMC$input.par @ Comment 2: This file contains all the tuning parameter values (defaults) for running \texttt{CEMC}. The user may modify this file and use it as the input to the {\em input.par} argument in the \texttt{CEMC} function. \paragraph{Comparison of performance of the different algorithms} In the example given above, we discussed the performances of the different algorithms and assumptions since the underlying truth can be guessed to a large extend since the example was contrived. However, in a real data application, the ground truth is typically completely unknown, therefore, it would be helpful to have some objective criterion to evaluate the relative performances of the various algorithms. To this end, we have implemented two functions based on the modified Kendall distance (MKD) \cite{lin} in the \texttt{TopKSpace} module. The first function is \texttt{Kendall}, which computes the weighted sum of MKD between an aggregate top-\emph{k} list with the input lists. Of the input arguments, the first two ({\em input, aggregate}) are required. If argument {\em space} is not supplied, all input lists are assumed to come from the same underlying space, the union of all elements in the input. If user has prior information on which list is more informative, then that information can be utilized by specifying the weight vector {\em w}. The output of \texttt{Kendall} is the \texttt{MKD}, a single number. However, it would be much more meaningful to compare \texttt{MKD}s accross results from different algorithms to assess their relative performances. This can be achieved by invoking the \texttt{Kendall.plot} function, which not only provides a vector of \texttt{MKD}s but also provides a plot of the \texttt{MKD}s for ease of comparison visually. There are also two mandatory input arguments, \texttt{input} and \texttt{all.aggregates}, where \texttt{all.aggregates} is a list comprising aggregate top-\emph{k} lists from different algorithms to be compared. \\ \vspace{2.5mm} The following command will compute the \texttt{MKD} for the aggregate list form the Borda algorithm \texttt{ARM}. <<>>= KendallMLists(input,space, outBorda$TopK[,1]) all.aggregates=list(outBorda$TopK[,1],outBorda$TopK[,2],outBorda$TopK[,3], outBorda$TopK[,4],outMC$MC1.TopK,outMC$MC2.TopK,outMC$MC3.TopK,outCEMC$TopK) @ <>= Kendall.plot(input,all.aggregates,space,algorithm=c("ARM","MED","GEO","L2N","MC1","MC2","MC3","CEMC")) @ Runing the \texttt{Kendall} function returns the \texttt{MKD} for the aggregate list from the Borda algorithm \texttt{ARM}, whereas the \texttt{Kendall.plot} function returns a vector of \texttt{MKD}s, with the first one matching the output from the \texttt{Kendall} function. Figure~\ref{fig:plotKendall} is the graphical component of the \texttt{Kendall.plot} output. As we can see, the range on the y-axis is very small, indicating similar performance of all algorithms, especially those from \texttt{Borda}. \subsection{TopKGraphics} \subsubsection{Deltaplot} The function \texttt{deltaplot} provides an exploratory graph (for examples see Figures \ref{fig:deltaplot_complete} and \ref{fig:deltaplot_subset200}), designed to help the user with selecting the distance parameter $\delta$ in the \texttt{TopKListsGUI} interface. The input for the moderate deviation-based inference procedure is a sequence of $I$'s, taking either zero or one, forming a data stream representing the concordance of the paired ranks of an object $o$. The data stream depends on some value for the distance $\delta$. The parameter $\delta$ is defined by the shift in index positions of a particular object~$o$ in one list, say $L_i$, with respect to the other list, say $L_j$. This means that we assume concordance (i.e. $I = 1$) for an arbitrary object characterized by rank positions in $L_i$ versus $L_j$ , maximal $\delta$ index values apart. For the identification of an appropriate $\delta$ in real data analysis, the following strategy is employed: we compute all data streams for $\delta \in [0,1,2,\ldots,N-1]$ and order the data stream vectors column-wise according to increasing $\delta$ values. In this way, we obtain a $N\times N$ matrix $\Delta$. The ordered sequence of column sums (i.e. the number of 0's) for $\delta \in [0,1,2,\ldots,N-1]$) is the information we take advantage of in the so-called \textit{deltaplot}. It represents the reduction of discordance as a function of $\delta$. When all column sums remain zero, complete concordance is attained. \subsubsection{Aggregation map} The aggregation map can only be accessed via the \texttt{TopKListsGUI} (see Section~\ref{GUI}). It can be characterized as follows: We define an index $p=1,2,\ldots$ and combine $\ell-1$ aggregation levels (groupings of truncated lists) in one display: For each group of $\ell-p$ truncated lists down to the smallest group consisting of just one pair of lists, we (i) select an arbitrary reference list $L^0$ under the condition that it comprises $\max_i(\hat{k}_i)$ items among all pairwise comparisons in the group of rankings, (ii) print the symbols of its $\max_i(\hat{k}_i)$ items vertically from the highest to the lowest rank position, and (iii) add the aggregation information for all remaining $\ell-p$ rankings (pairwise list combinations) in the group, ordered according to descending list length. The aggregation map of the breast cancer dataset can be seen in Figure \ref{aggmap}. \subsubsection{Venn diagram} The Venn diagram can only be accessed via the \texttt{TopKListsGUI} (see Section~\ref{GUI}). The Venn diagram, and the respective Venn table, show overlaps of objects in the top-\emph{k} lists for all analyzed input lists. Asterisk-tagged objects are those from the final aggregate list, produced by \texttt{TopKSpace}. \section{Parameter selection} \label{sec:parameterselection} For making inference on pairs of ranked lists, tuning parameters have to be specified. This is for the following reason: For a given set of $N$ objects, arbitrary ranked lists can be constructed by successive permutations. However, this fact does not help in practice when we have to analyze realizations of such lists because they comprise irregularities in terms of position shifts, inverted orderings, missing assignments, etc. As a consequence, a unique top-$k$ list or a complete set of top-conforming objects does not exist. This is the reason why the truncation algorithm (and any other algorithm) in \texttt{TopKLists} needs to be controlled by tuning parameters. We therefore adhere to the well established notion of top-$k$ list\underline{\textbf{s}}. By definition, a top-$k$ list consists of $k$ items. The next index value after $k$ is $j_0$, the point of overlap degradation ($k$=$j_0$-1).\\ \quad\newline There are two main parameters that need to be specified for the truncation algorithm: \begin{enumerate} \item distance parameter $\delta$ \item pilot sample size $\nu$ \end{enumerate} Besides the $\delta$ and $\nu$ parameters, the algorithm uses a constant $C$ which allows us to compensate for poor separability between the informative top parts and the remaining random parts of the input lists (the suggested interval is $0.25>= deltaplot.dir = paste0(tempdir(), "/deltaplot") dir.create(deltaplot.dir, showWarnings = FALSE) subplot.dir = paste0(tempdir(), "/subplot") dir.create(subplot.dir, showWarnings = FALSE) @ <>= library(TopKLists) data(breast) a=deltaplot(breast, deltas = seq(0,300, by=5), directory=deltaplot.dir) @ \begin{figure}[H] \centering \includegraphics[scale=0.35]{\Sexpr{gsub("\\\\", replacement="/", deltaplot.dir)}/deltaplotL2-3.pdf}\\ \caption{Deltaplot for lists MDCC vs. Pusztai and vice versa} \label{fig:deltaplot_complete} \end{figure} When observing the deltaplot for the first two breast cancer lists (Figure \ref{fig:deltaplot_complete}), it is not immediately apparent which value for $\delta$ should be used. This is because there is no visible point where the \texttt{deltaplot}'s decrease is changing its direction. \\ \vspace{2.5mm} It can also be observed from Figure~\ref{fig:deltaplot_complete} that switching the reference list from, for example \texttt{Putsztai} to \texttt{MDCC} or vice versa, has only little impact on the graph. Therefore, the order of the lists is not relevant. In order to choose a suitable value for $\delta$, we must limit the deltaplot calculation to a smaller subset of the investigated lists. There is no general rule how large this subset should be, but users are encouraged to try a subset of the first 20\% - 50\% objects. One should search for those points where the decrease suddenly slows down. If there are several such points, the user should consider the smallest one as the suitable $\delta$.\\ \vspace{2.5mm} In our breast cancer dataset (of length 917), we took approximately the first 20\%, specifically the first 200 objects. This subset can be specified directly in the \texttt{deltaplot} function as using the parameter \texttt{subset.lists}. Please note that this parameter takes the number of objects (not the percentage) as an input. Additionally, the \texttt{subplot} parameter can be used, which magnifies a selected range of small $\delta$-values in a subplot, defined by the \texttt{perc.subplot} parameter (the input is a percentage of the main graph, e.g. 50\% of the graph is defined as \verb"perc.subplot=50"). <>= a=deltaplot(breast, deltas = 1:50, subset.lists=200, subplot = TRUE, perc.subplot=50, directory=subplot.dir) @ \begin{figure}[H] \centering \includegraphics[scale=0.35]{\Sexpr{gsub("\\\\", replacement="/", subplot.dir)}/subplotL2-3.pdf}\\ \caption{Deltaplot for lists MDCC vs. Pusztai for first 200 objects} \label{fig:deltaplot_subset200} \end{figure} In Figure \ref{fig:deltaplot_subset200}, it can be seen that after $\delta=6$ the decrease of the deltaplot is not as pronounced as for the values $7,8,9$ and $10$. Therefore, $\delta=6$ would be the first reasonable choice. \subsection{$\nu$ selection} Another important tuning parameter is the pilot sample size $\nu$, which controls the degree of irregularity of the rankings. For this smoothing parameter any positive integer value is admissible. The effect of the choice of the $\nu$ parameter on an arbitrary dataset can be seen in section \ref{different_nu}. For most datasets values between $2$ and $10$ are appropriate. Very large datasets require larger values of $\nu$ in the order of tens. For our breast cancer data of length 917, we used $\nu=10$. \section{Data analysis} \label{sec:analysis} The purpose of the \texttt{TopKLists} package is to answer the following two main questions: \begin{enumerate} \item What is the top-$k$ range of highest rank overlap between lists? \item Which objects in which order form the (top-$k$) consolidated common list? \end{enumerate} The overall procedure leading to these answers is depicted in Figure \ref{fig:flowchart}. Red is connected to the search for the index top-$k$ and blue to the aggregation of lists. Both tasks can be targeted independently for arbitrary list lengths, but users should be warned, that the aggregation functions are computationally extensive and might take a significant amount of time. That is why we recommend, for lists longer than about 100, to truncate them at the estimated index max $\hat{k}$ to speed up the aggregation algorithms. As mentioned in the introduction, the user can achieve these results using the console or the GUI. \subsection{Via console} \subsubsection{Estimation of the index $k$} The calculation of the overall truncation index $k$ for pre-specified values of $\delta$ and $\nu$ (see section \ref{sec:parameterselection} about their choice) is performed using the \texttt{j0.multi} function. This function takes the input lists, calculates for all pairwise list combinations \texttt{Idata} vectors of 0's and 1's (see section \ref{subsec:topkinference}), and then estimates the $\hat{j}_0=\hat{k}+1$. Finally, the $k$ is calculated as the maximum of all pairwise $\hat{k}$ values, and outputted as \texttt{maxK}. Besides that, the function provides partial results for each pair of lists (under \texttt{L}), and the \texttt{Idata} vector for each pair of lists (under \texttt{Idata}). <<>>= library(TopKLists) data(breast) res = j0.multi(breast, d=6, v=10) sapply(res, head) @ We can see that the estimated \verb"maxK" is 14, i.e. there is a strong concordance between the first 14 objects of the input lists and any concordance beyond the index 14 is most likely due to random effects.\\ The function \texttt{j0.multi} itself uses the functions \texttt{prepare.idata} to prepare the \texttt{Idata} vector of 0's and 1's, and the function \texttt{compute.stream} to calculate the point of degradation $\hat{j}_0=\hat{k}+1$ (see the \href{http://cran.r-project.org/web/packages/TopKLists/TopKLists.pdf}{Reference Manual} for more information on these functions).\\ \quad\newline \subsubsection{Aggregation of lists} To aggregate multiple lists, here we use the three different aggregation methods (\texttt{Borda}, \texttt{MC}, and \texttt{CEMC}) for comparative purposes. First we truncate the lists to the length of $k=14$, the estimated index for the most overlapping genes of the three lists. <<>>= k = res$maxK TransBig=as.character(breast[1:k,1]) MDCC=as.character(breast[1:k,2]) Pusztai=as.character(breast[1:k,3]) input=list(TransBig,MDCC,Pusztai) @ Then we have to specify the underlying space, i.e. the set of objects the input lists come from. If space is not specified, all lists are treated as coming from a common space defined by the union of all input lists. The underlying space is defined by the parameter \texttt{space}, which should be in the \texttt{list} format - one set of objects for each input list. This is because each input list could generally come from a different set of objects (different space). In our case, the set of genes contained in each list is not identical and each list can contain a different set of 14 top genes. Since each of the top-14 gene lists come from the common set of 917 genes, we shall define the space as the set of all the genes present in these 3 truncated lists. In this case there are 22 unique genes (variable \texttt{common}). <<>>= common=unique(unlist(input)) space=list(common,common,common) @ Finally, we can run the aggregation functions. <<>>= outBorda=Borda(input,space) outMC=MC(input,space) outCEMC=CEMC(input,space,N=2000) outCEMC$TopK[1:k] @ The output printed gives the consolidated top-14 genes in \texttt{CEMC}-optimized rank order. The \texttt{Borda}, \texttt{MC} and \texttt{CEMC} results can be displayed in matrix form for comparison. As described in detail in section \ref{TopKSpace_function_calling}, the \texttt{Borda} function comprises four Borda algorithms - arithmetic mean (ARM), median (MED), geometric mean (GEO), and L2-norm(L2N). The \texttt{MC} function comprises three Markov chain algorithms - MC1 (spam sensitive), MC2 (majority rule), and MC3 (proportional). The \texttt{CEMC} function consist of solely one algorithm, Cross Entropy Monte Carlo. <<>>= agg=list(ARM=outBorda$TopK[,1],MED=outBorda$TopK[,2],GEO=outBorda$TopK[,3], L2N=outBorda$TopK[,4],MC1=outMC$MC1.TopK, MC2=outMC$MC2.TopK,MC3=outMC$MC3.TopK,CEMC=outCEMC$TopK) head(do.call(cbind, agg)) @ Additionally, we can plot the results from the different functions. <>= Kendall.plot(input,agg,space,algorithm=c("ARM","MED","GEO","L2N","MC1","MC2","MC3","CEMC")) @ Note: The results indicate that Borda with the median function (MED) and the CEMC algorithm perform the best in this example as they lead to the smallest distance between the aggregate and individual lists (Figure \ref{fig:aggplot}). \subsection{Via the GUI} \label{GUI} The \texttt{TopKLists} package comprises a graphical user interface \texttt{TopKListsGUI} that offers applied researchers easy and straightforward access to all the functionality of \texttt{TopKInference} as well as access to the aggregation technique \texttt{CEMC} of \texttt{TopKSpace}, and the \texttt{aggmap} graphical aggregation tool contained in \texttt{TopKGraphics}. Moreover, Venn-diagrams and Venn-tables for the summary of rank aggregation results are accessible. The suite of functions supported by the \texttt{TopKListsGUI} facilitates the exploratory analysis of multiple ranked lists. Interactive input facilities (such as a slider for the dynamic specification of the distance $\delta$) allow for the real-time analysis and visualization of results. These results correspond to the distance $\delta$ or the pilot sample size $\nu$ (note that, for large list lengths of $N$, many numbers of calculations must be performed and longer waiting times should be expected). \\ To open the GUI using the breast cancer data introduced in section \ref{subsec:breastdata}, run <>= TopKListsGUI(breast) @ The window shown in Figure\ref{topkguistart} appears, allowing the user to perform an analysis of two or more lists. The TopKListsGUI window is divided into four panels: arguments, analysis status, delta-slider and results. In the arguments panel (Figure \ref{topkguiarg}), the user is informed about the number of lists and objects in the dataset. Further, the user can set up variable values for the top-\emph{k} list computation. The user must select $\nu$ (default value is 10) and choose the $\delta$ range for which the top-\emph{k} lists should be calculated (the default is from 0 to 10, in steps of 1). A threshold for the minimum percentage of top lists comprising an object, used for gray-shading in the aggregation map, can also be selected (default is 50). The \texttt{TopKSpace} algorithm for the generation of the aggregate list is set by default to \texttt{CEMC}. Last but not least, this panel contains the calculate button, which starts the analysis. In the bottom section of \texttt{TopKListsGUI} the status of the analysis is shown. Once the analysis is complete, the delta-slider (Figure \ref{topkguislider}) allows the user to switch between the results, obtained using different values for $\delta$ interactively. The moving bar shows the current value of $\delta$, and the aggregation map updates accordingly. Moving the bar to the left or right allows the user to see results for smaller or larger $\delta$ values. \begin{figure} \centering \includegraphics[width=0.70\textwidth]{TopKLists_workflow.pdf}\\ \caption{Flowchart} \label{fig:flowchart} \end{figure} \begin{figure}[H] \centering \includegraphics[scale=0.35]{TopKListsGUI-1.png}\\ \caption{Main window of the \texttt{TopKListsGUI} \label{topkguistart}} \end{figure} \begin{figure}[H] \centering \includegraphics[scale=0.35]{TopKListsGUI-1-Arguments.png}\\ \caption{The arguments panel of the \texttt{TopKListsGUI} main window \label{topkguiarg}} \end{figure} \begin{figure}[H] \centering \includegraphics[scale=0.4]{TopKListsGUI-1-DeltaSlider.png}\\ \caption{The delta-slider section of the \texttt{TopKListsGUI} main window \label{topkguislider}} \end{figure} \section{Graphical representation} \label{sec:graphical} Once the analysis is complete, the main panel in the GUI serves the visualization of the results for a selected $\delta$ value (the position of the delta-slider). The results of the analysis are displayed in three tabs, described below. \subsection{Aggregation map} The aggregation map (Figure \ref{aggmap}) shows the top-$k$ list of overlapping genes and their respective distances. It displays aggregation groups, each with a different reference list $L^0$ (in our case two groups - first with \texttt{Pusztai} as the reference list, second with \texttt{MDCC} as the reference list). The aggregation information per symbol, item, and group consists of three measures represented by colored triangles and rectangles, respectively, outlined in an array format: \begin{description} \item[a)] The \textbf{membership} of an individual item in the top-\emph{k} lists. \textit{Yes} is denoted by the color `grey' and \textit{no} by the color `white'. \item[b)] The \textbf{distance} $d$ of the rank of an individual item $o\in L^0$ from its position in another list, is denoted by a triangle color scaled from `red' \textit{identical} to `yellow' \textit{far distant}. An additional integer value gives the numerical distance between the item's rank positions, a negative sign means ranked lower, and a positive sign means ranked higher in $L$ with respect to $L^0$. \item[c)] The rectangular of a symbol takes on the color `grey' when the \textbf{percentage} of $d \leq \delta$ across the columns of a group is above some prespecified threshold, and `white' otherwise. \end{description} As shown in Figure\ref{aggmap}, the best overlap was found when comparing the MDCC and TransBig lists against the Pusztai list, for $\delta = 6$ (see Section~\ref{sec:introduction} for a description of the example data). \vspace{2 cm} \begin{figure}[H] \centering \includegraphics[scale=0.35]{aggmap.png}\\ \caption{Aggregation map result of the \texttt{TopKListsGUI} \label{aggmap}} \end{figure} \begin{figure}[H] \centering \includegraphics[scale=0.35]{summarytable.png}\\ \caption{Summary table result of the \texttt{TopKListsGUI} \label{topkgui_summary}} \end{figure} \begin{figure}[H] \centering \includegraphics[scale=0.35]{venndiagram.png}\\ \caption{Venn-diagram and Venn-table results of the \texttt{TopKListsGUI} \label{venn}} \end{figure} \vspace{2 cm} \subsection{Summary table} The summary table displays a table of objects present in at least one of the top-\emph{k} lists, as shown in the aggregation map. The table describes each object name, position of the object in each of the lists, sum of ranks, frequency in the input lists, and frequency in the truncated lists. Please note, that the objects which have assigned the tag `YES' in the first column are in the top-\emph{k} set of genes selected by \texttt{CEMC} (under default tuning parameters as explained in Section~\ref{sec:topkspace}). The table is ordered in ascending order with regard to the sum of ranks of the objects in the three lists. \subsection{Venn-diagram and Venn-table} The produced Venn-diagram and Venn-table show overlaps of objects in the top-\emph{k} lists of all input lists in the analysis. The asterisk-tagged objects, in this example the genes, are those that belong to the final aggregate list as returned by the \texttt{TopKSpace} module. \section{Summary of breast cancer data results} As can be seen in the previous section, the estimated \verb"maxK" is 14, with 13 genes present in all three top-\emph{k} lists. Another top-ranked gene (BLOC1S1; rank 13) is only supported by Pusztai. Alltogether, the first 14 genes in the input lists are the most informative ones and should be considered as related to breast cancer. In the summary table (Figure \ref{topkgui_summary}), 14 genes are tagged YES, meaning that they belong to the aggregate list calculated by the CEMC method. Most consistent are the ESR1 and TBC1D9 genes, occupying the first two positions in all three lists, as displayed in the aggregation map (with a distance of 0 in both MDCC and TransBig compared to Pusztai) and in the consolidated aggregate list obtained from \texttt{CEMC}. The Venn-diagram and Venn-table for the same data provide insight into the overlap characteristics of the top-ranked genes. Platform differences can thus be easily identified. \bibliographystyle{plain} \begin{thebibliography}{} \bibitem{desmedt} Desmedt C., Piette F., Loi S., Wang Y. et al. (2007). Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. \emph{Clin Cancer Res}, \textbf{13}, 3207-3214. \bibitem{haschi} Hall, P. and Schimek, M. G. (2012). Moderate deviation-based inference for random degeneration in paired rank lists. \emph{J. Amer. Statist. Assoc.}, \textbf{107}, 661-672. \bibitem{linding} Lin, S. and Ding, J. (2009). Integration of ranked lists via Cross Entropy Monte Carlo with applications to mRNA and microRNA studies. \emph{Biometrics}, \textbf{65}, 9-18. \bibitem{lin2010} Lin, S. (2010a). Space oriented rank-based data integration. \emph{Statistical Applications in Genetics and Molecular Biology}, \textbf{9}, Article 20. \bibitem{lin} Lin, S. (2010b). Rank aggregation methods. \emph{Wiley Interdisciplinary Reviews: Computational Statistics}, \textbf{2}, 555–570. \bibitem{schimetal} Schimek, M. G., Budinská, E., Kugler, K. G., Švendová, V., Ding, J., Lin, S. (2015). TopKLists: a comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists. \emph{Statistical Applications in Genetics and Molecular Biology}, \textbf{14(3)}: 311-316. \bibitem{schimybu} Schimek, M. G., My\v{s}i\v{c}kov\'{a}, A. and Budinsk\'{a}, E. (2012). An inference and integration approach for the consolidation of ranked lists. \emph{Communications in Statistics - Simulation and Computation}, \textbf{41:7}, 1152-1166. \bibitem{shi} Shi L., Campbell G., Jones W.D., Campagne F. et al. (2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. \emph{Nat Biotechnol}, \textbf{28}, 827-838. \bibitem{tabchy} Tabchy A., Valero V., Vidaurre T., Lluch A. et al. (2010). Evaluation of a 30-gene paclitaxel, fluorouracil, doxorubicin, and cyclophosphamide chemotherapy response predictor in a multicenter randomized trial in breast cancer. \emph{Clin Cancer Res} \textbf{16}, 5351-5361. \end{thebibliography} \end{document}