Thesis (Selection of subject)Thesis (Selection of subject)(version: 392)
Thesis details
   Login via CAS
Separation of Overlapping Clusters in High Dimensional Spaces
Thesis title in Czech: Separation of Overlapping Clusters in High Dimensional
Spaces
Thesis title in English: Separation of Overlapping Clusters in High Dimensional Spaces
Academic year of topic announcement: 2006/2007
Thesis type: diploma thesis
Thesis language:
Department: Department of Software Engineering (32-KSI)
Supervisor: RNDr. Leo Galamboš, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 26.10.2006
Date of assignment: 26.10.2006
Guidelines
- Compute a histogram (or any other density estimator) for high dimensional data
- Intersect the histogram with high dimensional planes
- Use the intersection sets to describe the shape of the clusters and identify the intersection clusters
References
D.A. Keim and A. Hinneburg, "Clustering Techniques for Large Data Sets: From the Past to the Future," Proc. Int'l Conf. Knowledge Discovery and Data Mining, ACM Press, 1999, pp. 141-181.

A. Hinneburg and D.A. Keim, "Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering," Proc. 25th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1999, pp. 506-517.

R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.

M. Ester et al., "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. 2nd Int'l Conf. Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 226-231.

X. Xu et al., "A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases," Proc. 14th Int'l Conf. Data Eng., IEEE CS Press, 1998, pp. 324-331.

E. Schikuta, "Grid-Clustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets," Proc. 13th Int'l Conf. Pattern Recognition, IEEE CS Press, 1996, pp. 101-105.

T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103-114.

W. Wang, J. Yang, and R.R. Muntz, "Sting: A Statistical Information Grid Approach to Spatial Data Mining," Proc. 23rd Int'l Conf. Very Large Databases, Morgan Kaufmann, 1997, pp. 186-195.

A. Hinneburg and D.A. Keim, "An Efficient Approach to Clustering in Large Multimedia Databases with Noise," Proc. 4th Int'l Conf. Knowledge Discovery and Data Mining, AAAI Press, 1998, pp. 58-65.

A. Hinneburg and D. Keim and M. Wawryniuk: Using Projections to Visually Cluster High-Dimensional Data. Computing in Science and Engineering, Vol 5, Num 2, pg 14-25, ISSN 1521-9615, 2003.

A. Hinneburg and D.A. Keim, "Clustering Methods for Large Databases: From the Past to the Future," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1999, p. 509.

R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94-105.

C.C. Aggarwal et al., "Fast Algorithms for Projected Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1999, pp. 61-72.

C.C. Aggarwal and P.S. Yu, "Finding Generalized Projected Clusters in High Dimensional Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2000, pp. 70-81.

C.M. Procopiuc et al., "A Monte Carlo Algorithm for Fast Projective Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2002, pp. 418-427.

M.O. Ward, "XmdvTool: Integrating Multiple Methods for Visualizing Multivariate Data," Proc. Visualization '94, IEEE CS Press, 1994, pp. 326-336.

A. Buja, D.F. Swayne, and D. Cook, "Interactive High-Dimensional Data Visualization," J. Computational and Graphical Statistics, vol. 5, no. 1, 1996, pp. 78-99.

A. Inselberg and B. Dimsdale, "Parallel Coordinates: A Tool for Visualizing Multi-Dimensional Geometry," Proc. Visualization '90, IEEE CS Press, 1990, pp. 361-370.

D. Keim and M. Ward, "Visual Data Mining Techniques," to be published in Intelligent Data Analysis: An Introduction, M. Bertold and D. Hand, eds., Springer, 2003.

A. Hinneburg, M. Wawryniuk, and D.A. Keim, "HD-Eye: Visual Mining of High-Dimensional Data," IEEE Computer Graphics&Applications, vol. 19, no. 5, 1999, pp. 22-31.

B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman&Hall, 1986.

M.P. Wand and M.C. Jones, Kernel Smoothing, Chapman&Hall, 1995.

A. Hinneburg and D. Keim, "A General Approach to Clustering in Large Databases with Noise," to be published in J. Knowledge and Information Systems, Springer, 2002.

F. Aurenhammer, "Voronoi Diagrams: A Survey of a Fundamental Geometric Data Structure," ACM Computing Surveys, vol. 23, no. 3, 1991, pp. 345-405.

R. Klein, Concrete and Abstract Voronoi Diagrams, Springer, 1989.

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From Data Mining to Knowledge Discovery in Databases," AI Magazine, vol. 17, no. 3, 1996, pp. 37-54.
Preliminary scope of work
Current clustering techniques group database observations according to their distance or similarity. This works well for non-overlapping clusters, but yields poor results if the clusters overlap. MSc project proposes a new clustering technique that separates overlapping clusters.

We will use the density function (histograms) of the data to identify cores, i.e., areas of the same shape. Cores will be computed in
two steps: first we intersect the density function with high dimensional planes, and second, we use the intersection to determine and
use the shape of the clusters to separate them.
Preliminary scope of work in English
Current clustering techniques group database observations according to their distance or similarity. This works well for non-overlapping clusters, but yields poor results if the clusters overlap. MSc project proposes a new clustering technique that separates overlapping clusters.

We will use the density function (histograms) of the data to identify cores, i.e., areas of the same shape. Cores will be computed in
two steps: first we intersect the density function with high dimensional planes, and second, we use the intersection to determine and
use the shape of the clusters to separate them.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html