Project

CS776 Evolutionary Computation Project

A description for my project will be here soon.

Related Work:

Petra Kudov´a "Clustering Genetic Algorithm"

Abstract: In this paper we present and study a clustering technique based on genetic algorithms – Clustering Genetic Algorithm. Performance of the algorithm is demonstrated on experiments. We have shown that it outperforms the k-means algorithm on some tasks. In addition, it is capable of optimising the number of clusters for tasks with well formed and separated clusters.
Marco Painho and Fernando Bação "Using Genetic Algorithms in Clustering Problems"

Abstract: In this paper a Genetic Algorithm (GA) approach to the clustering problem is proposed. The optimisation capabilities of Genetic Algorithms are well known and commonly used in a variety of scientific fields. Openshaw and Openshaw (1997) note that Genetic Algorithms are "an extremely powerful, widely applicable search technique that provides a global search for problems with many local suboptimal". Here the clustering task is tackled through a Genetic Algorithm, which attempts to minimize the within cluster variance. The Genetic Algorithm is tested against the traditional K-Means method, and an unsupervised neural network (Kohonen's self organising map). First, through the use of artificial generated data a particular environment is set up, then a real world example, involving the electoral results in Portugal, is used. Another important aspect is related with the possibility of using GA in geographical clustering problems. It seems obvious that the flexibility offered by GA should encourage researchers to develop new and improve ways of tackle the clustering tasks in Geography.
Luiz Antonio Nogueira Lorena and Jo˜ao Carlos Furtado "Constructive Genetic Algorithm for Clustering Problems"

Abstract: Genetic algorithms (GAs) have recently been accepted as powerful approaches to solving optimization problems. It is also well-accepted that building block construction (schemata formation and conservation) has a positive influence on GA behavior. Schemata are usually indirectly evaluated through a derived structure. We introduce a new approach called the Constructive Genetic Algorithm (CGA), which allows for schemata evaluation and the provision of other new features to the GA. Problems are modeled as bi-objective optimization problems that consider the evaluation of two fitness functions. This double fitness process, called fg-fitness, evaluates schemata and structures in a common basis. Evolution is conducted considering an adaptive rejection threshold that contemplates both objectives and attributes a rank to each individual in population. The population is dynamic in size and composed of schemata and structures. Recombination preserves good schemata, and mutation is applied to structures to get population diversification. The CGA is applied to two clustering problems in graphs. Representation of schemata and structures use a binary digit alphabet and are based on assignment (greedy) heuristics that provide a clearly distinguished representation for the problems. The clustering problems studied are the classical p-median and the capacitated p-median. Good results are shown for problem instances taken from the literature.
Nikhil Bansal, Avrim Blum, and Shuchi Chawla "Correlation Clustering"

Abstract: We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge u v is labeled either or depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of edges within clusters, plus the number of edges between clusters (equivalently, minimizes the number of disagreements: the number of edges inside clusters plus the number of edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem. An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we givea PTAS, building on ideas of Goldreich, Goldwasser and Ron (1998) and de la Vega (1996). We also show how to extend some of these results to graphs with edge labels in [-1 , +1] , andgive some results for the case of random noise.
Moses Charikar, Venkatesan Guruswami, Anthony Wirth "Clustering with Qualitative Information"

Abstract:We consider the problem of clustering a collection of elements based on pairwise judgments of similarity and dissimilarity. Bansal, Blum and Chawla [1] cast the problem thus: given a graph G whose edges are labeled "+" (similar) or "-" (dissimilar), partition the vertices into clusters so that the number of pairs correctly (resp. incorrectly) classified with respect to the input labeling is maximized (resp. minimized). Complete graphs, where the classifier labels every edge, and general graphs, where some edges are not labeled, are both worth studying. We answer several questions left open in [1] and provide a sound overview of clustering with qualitative information. We give a factor 4 approximation for minimization on complete graphs, and a factor O(log n) approximation for general graphs. For the maximization version, a PTAS for complete graphs is shown in [1]; we give a factor 0.7664 approximation for general graphs, noting that a PTAS is unlikely by proving APX-hardness. We also prove the APX-hardness of minimization on complete graphs.
Erik D. Demaine and Nicole Immorlica "Correlation Clustering with Partial Information"

Abstract:We consider the following general correlation-clustering problem [1]: given a graph with real edge weights (both positive and negative), partition the vertices into clusters to minimize the total absolute weight of cut positive edges and uncut negative edges. Thus, large positive weights (representing strong correlations between endpoints) encourage those endpoints to belong to a common cluster; large negative weights encourage the endpoints to belong to different clusters; and weights with small absolute value represent little information. In contrast to most clustering problems, correlation clustering specifies neither the desired number of clusters nor a distance threshold for clustering; both of these parameters are effectively chosen to be the best possible by the problem definition. Correlation clustering was introduced by Bansal, Blum, and Chawla [1], motivated by both document clustering and agnostic learning. They proved NP-hardness and gave constant-factor approximation algorithms for the special case in which the graph is complete (full information) and every edge has weight +1 or -1. We give an O(log n)-approximation algorithm for the general case based on a linear-programming rounding and the “region-growing” technique. We also prove that this linear program has a gap of O(log n), and therefore our approximation is tight under this approach. We also give an O(r3)-approximation algorithm for Kr, r-minor-free graphs. On the other hand, we show that the problem is APX-hard, and any o(log n)-approximation would require improving the best approximation algorithms known for minimum multicut.
Hila Becker "A Survey of Correlation Clustering"

Abstract:The problem of partitioning a set of data points into clusters is found in many applications. Correlation clustering is a clustering technique motivated by the the problem of document clustering, in which given a large corpus of documents such as web pages, we wish to find their optimal partition into clusters. While most commonly used clustering algorithms such as k-means, k-clustering sum and k-center require prior knowledge of the number of clusters that we wish to divide the data into, for the case of classifying web documents, finding the number of clusters is not a trivial task. Correlation Clustering, introduced by Bansal, Blum and Chawla, provides a method for clustering a set of objects into the optimal number of clusters, without specifying that number in advance. In this paper we present two different approximation algorithms for the Correlation Clustering problem. We then discuss some open problems and give our intuition as to how to approach them.
Dotan Emanuel and Amos Fiat "Correlation Clustering – Minimizing Disagreements on Arbitrary Weighted Graphs"

Abstract:We solve several open problems concerning the correlation clustering problem introduced by Bansal, Blum and Chawla [1]. We give an equivalence argument between these problems and the multicut problem. This implies an O(log n) approximation algorithm for minimizing disagreements on weighted and unweighted graphs. The equivalence also implies that these problems are APX-hard and suggests that improving the upper bound to obtain a constant factor approximation is non trivial. We also briefly discuss some seemingly interesting applications of correlation clustering.
Erik D. Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica "Correlation Clustering in General Weighted Graphs"

Abstract:We consider the following general correlation-clustering problem [1] given a graph with real nonnegative edge weights and a <+>/<-> edge labelling, partition the vertices into clusters to minimize the total weight of cut <+> edges and uncut <-> edges. Thus, <+> edges with large weights (representing strong correlations between endpoints) encourage those endpoints to belong to a common cluster while <-> edges with large weights encourage the endpoints to belong to different clusters. In contrast to most clustering problems, correlation clustering specifies neither the desired number of clusters nor a distance threshold for clustering; both of these parameters are effectively chosen to be best possible by the problem definition.Correlation clustering was introduced by Bansal et al. [Correlation clustering, in: Proc. 43rd Annu. IEEE Syrup. on Foundations of Computer Science, Vancouver, Canada, November 2002, pp. 238-250], motivated by both document clustering and agnostic learning. They proved NP-hardness and gave constant-factor approximation algorithms for the special case in which the graph is complete (full information) and every edge has the same weight. We give an O(log n)-approximation algorithm for the general case based on a linear-programming rounding and the "region-growing" technique. We also prove that this linear program has a gap of O(log n), and therefore our approximation is tight under this approach. We also give an O(r3)-approximation algorithm for Kr, r-minor-free graphs. On the other hand, we show that the problem is equivalent to minimum multicut, and therefore APX-hard and difficult to approximate better than T(log n).

Last Updated on Oct 12, 2011.