Comparative Study the Effect of Similarity Measures on K-Means Algorithm in Clustering Arabic Texts based on Keywords
Keywords:
Text mining , Arabic text clustering , K-means, Euclidian similarity , Cosine similarityAbstract
Texts clustering is one of important and effective tasks in texts mining, it aims to divide a large sets of texts into subsets called clusters, these clusters contain objects have high similar among themselves but are dissimilar to objects in the other clusters. In this work, we proposed method is used to cluster Arabic texts using one of the famous techniques called K-Means algorithm. The proposed method include analysis of text as a primary step to prepare it to clustering algorithm which applied to 100 Arabic texts in four different groups included (sport, art , crime , health). Our method developed by using database of keywords for each field to select cluster centers rather than selected it randomly , then two similarity measures(Euclidian similarity, Cosine similarity) are used to calculate the distances between the centers and the texts for building clusters. In addition , we evaluate the impact of the two similarity (Euclidian similarity, Cosine similarity) on the results of k-means by using F-Measures and the results were as a compared between Euclidian similarity and cosine similarity based on the number of factors such as number of clusters and number of groups. Finally, we found that the performance of k-means algorithm using cosine similarity work better than k-means algorithm using Euclidian similarity.