In the era of big data, how to select which experiments to conduct to fill the gap in the data at hand is becoming a challenge. The selected experiments should be targeted such that they bring in the most amount of information related to the problem of interest. By conducting targeted experiments, we will reduce the redundancy in the data and thus save time and money. The study of targeted experiments called optimal experimental design (OED) can be traced back to the 1950s [1]. Another field, active learning, which is a sub-filed of machine learning, also studies the similar problem to OED.

An OED problem is boiled down to selecting the most informative datapoints in terms of answering the question of interest. Over the years, various criteria have been proposed to quantify the information content of an unobserved datapoint. In all cases, methods are focused on solving a minimization (regularization error, variance) or maximization (uncertainty, representation) tasks. In the case of *uncertainty sampling*, the goal is to explore the samples that the model is the least confident on its predictions. Quantification of the uncertainty is key and often entropy [2, 3] is the measure that is used in the case of probabilistic models [4,5,6]. For non-probabilistic models, custom metrics that are suitable for each case have been used. For example, a margin-based active learning strategy was proposed and successfully applied for text classification with Support Vector Machines [7,8] and the level of agreement between neighbors was used as a sampling metric in a k-nearest neighbors algorithm [9].

For a classification model estimating a probability, the difference between 0.5 and the prediction can serve as a measure of uncertainty of a prediction [10]. Maximizing representation is a similar idea to selecting samples based on uncertainty, however in this case the entire input space is taken into account and the unlabeled sample that is most representative of all the unlabeled datapoints is chosen. Mutual information [11,12] between an instance and all other datapoints in the input space, and prediction uncertainty based on the unlabeled data [13] have been used in the past for this task. Unsupervised learning approach was also successfully employed to select the representative samples taking into account the distribution of the input [14].

References:

[1]. Lindley, Dennis V. “On a measure of the information provided by an experiment.” *The Annals of Mathematical Statistics*(1956): 986-1005.

[2]. V. Lindley, “On a measure of the information provided by an experiment,” The Annals of Mathematical Statistics, pp. 986–1005, 1956.

[3]. C. Shewry and H. P. Wynn, “Maximum entropy sampling,” Journal of applied statistics, vol. 14, no. 2, pp. 165–170, 1987.

[4]. Chaloner and I. Verdinelli, “Bayesian experimental design: A review,” Statistical Science, pp. 273–304, 1995

[5]. Krause, A. Singh, and C. Guestrin, “Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies,” Journal of Machine Learning Research, vol. 9, no. Feb, pp. 235–284, 2008.

[6]. Zhang and T. Chen, “An active learning framework for content-based information retrieval,” IEEE transactions on multimedia, vol. 4, no. 2, pp. 260–268, 2002.

[7]. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” Journal of machine learning research, vol. 2, no. Nov, pp. 45–66, 2001.

[8]. Greg and C. David, “Less is more: Active learning with support vector machines,” Proc. of the Seventeenth International Conference on Machine Learning, 2000.

[9]. Lindenbaum, S. Markovitch, and D. Rusakov, “Selective sampling for nearest neighbor classifiers,” Machine learning, vol. 54, no. 2, pp. 125–152, 2004.

[10]. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” in Proceed- ings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 3–12, Springer-Verlag New York, Inc., 1994.

[11]. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, “Representative sampling for text classification using support vector machines,” Advances in Information Retrieval, pp. 11–11, 2003.

[12]. Guo and R. Greiner, “Optimistic active-learning using mutual information.” in IJCAI, vol. 7, pp. 823–829, 2007.

[13]. S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” in Advances in neural information processing systems, pp. 892–900, 2010.

[14]. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,” in Proceedings of the twenty-first international conference on Machine learning, p. 79, ACM, 2004.