Keywords
Clustering, Data mining, Vectorization, Big data
Document Type
Article
Abstract
The exponential growth of data stored in databases and data warehouses has intensified the demand for advanced analytical tools capable of efficiently extracting actionable insights. Data mining has emerged as a critical methodology for transforming vast datasets into knowledge. However, traditional techniques often face scalability challenges, particularly due to prolonged computational times, which hinder their effectiveness in large-scale applications. This research aims to improve the K-prototypes clustering algorithm to optimize computational efficiency and enhance clustering precision, enabling more effective knowledge discovery. Key modifications involve implementing vectorization techniques to accelerate processing, redesigning the initialization strategy for cluster centers, and refining the convergence conditions. The upgraded algorithm was evaluated using the widely referenced Adult Dataset through three experimental frameworks: a performance assessment of runtime and accuracy, an investigation into the ideal weighting balance for categorical variables, and an evaluation of convergence behavior. Empirical results indicated a drastic reduction in processing time from 20 minutes to 4 seconds and a 4% increase in clustering accuracy. Additionally, the study found that the optimized algorithm achieves convergence within merely four iterations. These enhancements collectively demonstrate substantial improvements in both operational efficiency and analytical effectiveness for mixed-data clustering tasks.
Recommended Citation
Alkourdy, Faten Fagr
(2025)
"Optimizing K-Prototype or Scalable Mixed-Data Clustering: A Vectorized Approach to Accelerate Convergence and Accuracy in Big Data,"
Al-Farahidi Expert Systems Journal: Vol. 1:
Iss.
1, Article 8.
Available at:
https://fesj.uoalfarahidi.edu.iq/journal/vol1/iss1/8