Optimization Of The K-Means Algorithm Performance In Determining Centroid Data Using The Agglomerative Hierarchical Cluster Algorithm

by ADMIN 134 views

Introduction

In the clustering process, the K-Means algorithm starts by determining the center point of the initial cluster or known as centroid. The selection of initial centroids in the K-Means algorithm greatly affects the results of the clustering process. Usually, the selection of centroids is done randomly, which can result in inaccuracies in data grouping. However, in this study, the selection of centroids was carried out by taking the highest data from the cluster produced by the Agglomerative Hierarchical Clustering (AHC) algorithm. In this way, this study aims to compare the accuracy value between conventional K-Means and K-Means that use the AHC method.

Background of the Study

The K-Means algorithm is one of the most widely used clustering methods in data analysis. This method has advantages in speed and ease of implementation. However, one of the main weaknesses of K-Means is its dependence on the selection of early centroids. If centroid is chosen carelessly, clustering results can vary significantly, which can lead to wrong conclusions. By using AHC to determine the initial centroid, it is expected that clustering results will be more stable and accurate.

Methodology

In this study, the K-Means algorithm was used to cluster the data, and the Agglomerative Hierarchical Clustering (AHC) algorithm was used to determine the initial centroids. The AHC algorithm was used to group the data into clusters, and the highest data point from each cluster was selected as the initial centroid for the K-Means algorithm. The K-Means algorithm was then run with the selected centroids, and the results were compared to the conventional K-Means algorithm.

Results

The results of the calculation of Sum of Squared Errors (SSE) showed that K-Means that use centroids from AHC has increased significantly compared to conventional K-Means. Specifically, an increase in SSE was recorded at 4.8% for grouping with 2 clusters and 14.3% for grouping with 3 clusters. This shows that the more careful selection of centroids can produce better groupings and reduce errors that occur during the clustering process.

Discussion

The integration of AHC in the selection of centroids for the K-Means algorithm offers a new approach that is promising to increase accuracy and efficiency in data analysis. This approach not only reduces the variability of clustering results but also increases the reliability of data analysis, which is very important in various applications such as image processing, market segmentation, and customer behavior analysis. Thus, the application of this technique can provide added value for researchers and practitioners engaged in the field of data science and statistical analysis.

Conclusion

In conclusion, the use of AHC to determine the initial centroids for the K-Means algorithm has shown to improve the accuracy and efficiency of the clustering process. The results of this study suggest that the integration of AHC in the selection of centroids can provide a more stable and accurate clustering result. Therefore, this approach can be used as an alternative to the conventional K-Means algorithm, especially in applications where accuracy and reliability are crucial.

Recommendations

Based on the results of this study, the following recommendations are made:

  • The use of AHC to determine the initial centroids for the K-Means algorithm should be considered as an alternative to the conventional K-Means algorithm.
  • The integration of AHC in the selection of centroids can provide a more stable and accurate clustering result.
  • The use of AHC can be applied in various applications such as image processing, market segmentation, and customer behavior analysis.

Limitations of the Study

This study has several limitations that should be considered:

  • The study only used a small dataset to test the approach.
  • The study only compared the results of the K-Means algorithm with and without the use of AHC.
  • The study did not consider other clustering algorithms that can be used to determine the initial centroids.

Future Research Directions

Based on the results of this study, the following future research directions are suggested:

  • The use of AHC to determine the initial centroids for other clustering algorithms should be explored.
  • The integration of AHC in the selection of centroids can be applied in various applications such as image processing, market segmentation, and customer behavior analysis.
  • The use of AHC can be applied in other fields such as medicine, finance, and social sciences.

Conclusion

Q: What is the K-Means algorithm and how does it work?

A: The K-Means algorithm is a widely used clustering method in data analysis. It works by dividing the data into K clusters based on the similarity of the data points. The algorithm starts by randomly selecting the initial centroids, and then iteratively updates the centroids and assigns the data points to the closest cluster.

Q: What is the Agglomerative Hierarchical Clustering (AHC) algorithm and how does it work?

A: The AHC algorithm is a hierarchical clustering method that works by merging the closest clusters at each level. The algorithm starts by considering each data point as a separate cluster, and then iteratively merges the closest clusters until only one cluster remains.

Q: How does the AHC algorithm determine the initial centroids for the K-Means algorithm?

A: The AHC algorithm determines the initial centroids by selecting the highest data point from each cluster. This approach ensures that the initial centroids are more representative of the data and can lead to more accurate clustering results.

Q: What are the advantages of using the AHC algorithm to determine the initial centroids for the K-Means algorithm?

A: The use of AHC to determine the initial centroids for the K-Means algorithm has several advantages, including:

  • Improved accuracy and efficiency of the clustering process
  • Reduced variability of clustering results
  • Increased reliability of data analysis

Q: What are the limitations of the study?

A: The study has several limitations, including:

  • The study only used a small dataset to test the approach
  • The study only compared the results of the K-Means algorithm with and without the use of AHC
  • The study did not consider other clustering algorithms that can be used to determine the initial centroids

Q: What are the future research directions?

A: The following future research directions are suggested:

  • The use of AHC to determine the initial centroids for other clustering algorithms
  • The integration of AHC in the selection of centroids can be applied in various applications such as image processing, market segmentation, and customer behavior analysis
  • The use of AHC can be applied in other fields such as medicine, finance, and social sciences

Q: How can the results of this study be applied in real-world scenarios?

A: The results of this study can be applied in various real-world scenarios, including:

  • Image processing: The use of AHC to determine the initial centroids can improve the accuracy and efficiency of image segmentation algorithms.
  • Market segmentation: The use of AHC to determine the initial centroids can improve the accuracy and efficiency of market segmentation algorithms.
  • Customer behavior analysis: The use of AHC to determine the initial centroids can improve the accuracy and efficiency of customer behavior analysis algorithms.

Q: What are the implications of this study for data analysts and researchers?

A: The results of this study have several implications for data analysts and researchers, including:

  • The use of AHC to determine the initial centroids can improve the accuracy and efficiency of clustering algorithms.
  • The integration of AHC in the selection of centroids can provide a more stable and accurate clustering result.
  • The use of AHC can be applied in various applications such as image processing, market segmentation, and customer behavior analysis.