Hierarchical clustering of products using market-basket data


  • Ondrej Sokol University of Economics, Prague




product clustering, market basket data, hierarchical clustering, retail


The goal of this paper is to present a new method of clustering products based only on the market-basket data from the retail store. The presented approach uses a special way of computing the dissimilarity matrix on which Ward’s hierarchical clustering method is used. The similarity matrix stems from the co-occurrence of products in same basket as a utility data. As a similar are denoted products which have similar co-occurring products and simultaneously are not often present in the same basket. Hence, the method does not require the identification of the customer, neither the data from fixed time frame, which is an advantage over commonly used methods. The method is reasonably fast even over huge dataset of tens of millions rows. The results are promising and easy to interpret.


Download data is not yet available.


Borin, N., Farris, P. W., & Freeland, J. R. (1994). A Model for Determining Retail Product Category Assortment and Shelf Space Allocation. Decision Sciences, 25(3), 359–384. https://doi.org/10.1111/j.1540-5915.1994.tb00809.x

Cha, S. H. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. International Journal of Mathematical models and Methods in Applied Sciences, 1(4), 300-307.

Gruca, T. S., & Klemz, B. R. (2003). Optimal new product positioning: A genetic algorithm approach. European Journal of Operational Research, 146(3), 621–633. https://doi.org/10.1016/S0377-2217(02)00349-1

Holý, V., Sokol, O., & ?erný, M. (2017). Clustering Retail Products Based on Customer Behaviour. Applied Soft Computing. https://doi.org/10.1016/j.asoc.2017.02.004

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323.

Lallich, S., Teytaud, O., & Prudhomme, E. (2007). Association rule interestingness: Measure and statistical validation. In Quality measures in data mining (pp. 251-275). Springer, Berlin, Heidelberg.

Leeflang, P. S. H., Parreño Selva, J., Van Dijk, A., & Wittink, D. R. (2008). Decomposing the sales promotion bump accounting for cross-category effects. International Journal of Research in Marketing, 25(3), 201–214. https://doi.org/10.1016/j.ijresmar.2008.03.003

Russell, G. J., & Petersen, A. (2000). Analysis of cross category dependence in market basket selection. Journal of Retailing, 76(3), 367–392. https://doi.org/10.1016/S0022-4359(00)00030-0

Srivastava, R. K., Leone, R. P., & Shocker, A. D. (1981). Market Structure Analysis: Hierarchical Clustering of Products Based on Substitution-in-Use. Journal of Marketing, 45(3), 38. https://doi.org/10.2307/1251540

Ward, J. H., Jr. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58, 236–244.

Zhang, Y., (Roger) Jiao, J., & Ma, Y. (2007). Market segmentation for product family positioning based on fuzzy clustering. Journal of Engineering Design, 18(3), 227–241. https://doi.org/10.1080/09544820600752781




How to Cite

Sokol, O. (2020). Hierarchical clustering of products using market-basket data. International Conference on Advances in Business and Law (ICABL), 3(1), 88-93. https://doi.org/10.30585/icabl-cp.v3i1.488