제주대학교 Repository

데이터 속성에 따른 초기 클러스터 결정 및 클러스터링

Metadata Downloads
Alternative Title
The initial cluster decision and clustering method for the data attributes
Abstract
Clustering typically groups data into sets in such a way that the intra cluster similarity is maximized while the inter cluster similarity is minimized.
Most previous clustering algorithms focus on numerical data whose inherent geometric properties can be exploited naturally to define distance functions between data points. However, much of the data existed in the databases is categorical, where attribute values cannot be naturally ordered as numerical values.
The K-means algorithm is best suited for implementing this operation of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values.
Pena et al, compared empirically four initialization methods for the K-means algorithm: random, Forgy, Macqueen and Kaufman. Although this algorithm is known for its robustness, it is widely reported in literature that its performance depends upon two key points: initial clustering and instance.
Optimal determination of cluster size has an effect on the result of clustering. In K-means algorithm, the difference of clustering performance is large by initial k. But the initial cluster size is determined by prior know- ledge or subjectivity in most clustering process.
This subjective determination may not be optimal. Due to the special properties of categorical attributes, the clustering of categorical data seems more complicated than that of numerical data.
The K-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency based method to update modes in the clustering process to minimize the clustering cost function.
The original K-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. However, as is the case with most data clustering algorithms, the algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. The differences on the initial points often lead to considerable distinct cluster results.
The K-prototypes algorithm integrates the K-means and K-modes processes to cluster data with mixed numeric and categorical values. The method is developed to dynamically update the k's prototypes in order to maximize the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the K-means.
Ahmad and Dey proposed new cost function and distance measure based on co-occurrence of values. The measures also take into account the significance of an attribute towards the clustering process. In this algorithm presented a modified description of cluster center to overcome the numeric data only limitation of K-mean algorithm and provide a better characterization of clusters.
In this paper, we find another methods that work well for data with mixed numeric and categorical features. We propose a novel divide-and conquer technique to solve this problem.
Cluster ensemble is the method to combine several runs of different clustering algorithms to get a common partition of the original data sets, aiming for consolidation of results from a portfolio of individual clustering results.
First, the original mixed dataset is divided into two sub-data sets: the pure numeric dataset and the pure numeric data sets. Next, existing well established clustering algorithms designed for different types of data sets are employed to produce corresponding clusters. Last, the clustering results on the numeric and categorical data sets are combined as a categorical data sets, on which the categorical data clustering algorithm is used to get the final clusters.
The goal of this paper is to provide an algorithm framework for the mixed attributes clustering problem, in which existing clustering algorithms can be easily integrated, the capabilities of different kinds of clustering algorithms and characteristics of different types of data sets could be fully exploited. Comparisons with other clustering algorithms on real life data sets illustrate the proposal approach.
Author(s)
강형창
Issued Date
2008
Awarded Date
2008. 2
Type
Dissertation
URI
http://dcoll.jejunu.ac.kr/jsp/common/DcLoOrgPer.jsp?sItemId=000000004250
Alternative Author(s)
Kang, Hyung Chang
Affiliation
제주대학교 대학원
Department
대학원 전산통계학과
Advisor
김철수
Table Of Contents
I. 연구 배경과 목적 = 1
II. 클러스터링의 개념 = 6
1. 클러스터링의 배경 = 6
2. 클러스터링의 일반적 과정 = 8
3. 클러스터링을 위한 요구사항 = 11
III. K-means 기반 클러스터링 = 13
1. 수치 데이터에 대한 K-means 기반 알고리즘 = 13
1) K-means 알고리즘 = 13
2) K-medoids 알고리즘 = 15
3) K-means 알고리즘에 대한 초기 클러스터 결정 방법 = 19
2. 범주 데이터에 대한 K-means 기반 알고리즘 = 24
1) ROCK 알고리즘 = 25
2) K-modes 알고리즘 = 27
3. 혼합 데이터에 대한 K-means 기반 알고리즘 = 30
1) K-prototypes 알고리즘 = 30
2) Ahmad와 Dey의 제안 알고리즘 = 32
4. 클러스터 ensemble = 34
IV. 데이터 속성에 따른 클러스터링 = 35
1. 수치 데이터 클러스터링 문제 = 35
1) 계층적 클러스터링 문제 = 35
2) K-means 알고리즘 문제 = 35
2. 범주 데이터 클러스터링 문제 = 40
1) 도메인과 속성 = 40
2) K-modes 알고리즘 문제 = 40
3. 혼합 데이터 클러스터링 = 45
1) 혼합 데이터 클러스터 알고리즘 = 46
2) 초기 클러스터 결정을 위한 Modified K-means 알고리즘 = 46
3) 초기 클러스터 결정을 위한 K-priority 알고리즘 = 47
4) 초기 클러스터 결정을 위한 Modified K-modes 알고리즘 = 48
V. 실험결과 = 51
1. 실험환경 = 51
2. 실험 데이터 = 51
3. 실험결과 = 52
1) 평가방법 = 52
2) 혼합 데이터 클러스터링 실험결과 = 52
3) 수치 데이터 클러스터링 실험결과 = 53
4) 범주 데이터 클러스터링 실험결과 = 55
VI. 결론 = 59
참고문헌 = 62
Degree
Doctor
Publisher
제주대학교 대학원
Citation
강형창. (2008). 데이터 속성에 따른 초기 클러스터 결정 및 클러스터링
Appears in Collections:
General Graduate School > Computer Science and Statistics
공개 및 라이선스
  • 공개 구분공개
파일 목록

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.