Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time
Cochez, M., & Mou, H. (2015). Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time. In T. Sellis, S. B. Davidson, & Z. Ives (Eds.), SIGMOD '15 : Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 505-517). Association for Computing Machinery. https://doi.org/10.1145/2723372.2751521
Date
2015Copyright
© 2015 ACM. This is a final draft version of an article whose final and definitive form has been published by ACM. Published in this repository with the kind permission of the publisher.
Many commonly used data-mining techniques utilized across
research fields perform poorly when used for large data sets.
Sequential agglomerative hierarchical non-overlapping clustering
is one technique for which the algorithms’ scaling
properties prohibit clustering of a large amount of items.
Besides the unfavorable time complexity of O(n
2
), these
algorithms have a space complexity of O(n
2
), which can
be reduced to O(n) if the time complexity is allowed to
rise to O(n
2
log2 n). In this paper, we propose the use of
locality-sensitive hashing combined with a novel data structure
called twister tries to provide an approximate clustering
for average linkage. Our approach requires only linear space.
Furthermore, its time complexity is linear in the number of
items to be clustered, making it feasible to apply it on a
larger scale. We evaluate the approach both analytically
and by applying it to several data sets.
Publisher
Association for Computing MachineryParent publication ISBN
978-1-4503-2758-9Conference
ACM SIGMOD international conference on management of dataIs part of publication
SIGMOD '15 : Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPublication in research information system
https://converis.jyu.fi/converis/portal/detail/Publication/24741029
Metadata
Show full item recordCollections
License
Related items
Showing items with similar title or keywords.
-
Scalable Hierarchical Clustering : Twister Tries with a Posteriori Trie Elimination
Cochez, Michael; Neri, Ferrante (IEEE, 2015)Exact methods for Agglomerative Hierarchical Clustering (AHC) with average linkage do not scale well when the number of items to be clustered is large. The best known algorithms are characterized by quadratic complexity. ... -
Locality-sensitive hashing for massive string-based ontology matching
Cochez, Michael (IEEE, 2014)This paper reports initial research results related to the use of locality-sensitive hashing (LSH) for string-based matching of big ontologies. Two ways of transforming the matching problem into a LSH problem are proposed ... -
Facile fabrication of flower like self-assembled mesoporous hierarchical microarchitectures of In(OH)3 and In2O3: In(OH)3 micro flowers with electron beam sensitive thin petals
Prakasam, Balasubramaniam Arul; Lahtinen, Manu; Peuronen, Anssi; Muruganandham, Manickavachagam; Sillanpää, Mika (Elsevier S.A.; Chinese Society for Materials Scien, 2016)A template and capping-reagent free facile fabrication method for mesoporous hierarchical microarchitectures of flower-like In(OH)3 particles under benign hydrothermal conditions is reported. Calcination of In(OH)3 to In2O3 ... -
Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods
Niemelä, Marko; Kärkkäinen, Tommi (Springer, 2022)Missing data introduces a challenge in the field of unsupervised learning. In clustering, when the form and the number of clusters are to be determined, one needs to deal with the missing values both in the clustering ... -
A hierarchical cluster analysis to determine whether injured runners exhibit similar kinematic gait patterns
Jauhiainen, Susanne; Pohl, Andrew J.; Äyrämö, Sami; Kauppi, Jukka-Pekka; Ferber, Reed (Wiley-Blackwell, 2020)Previous studies have suggested that runners can be subgrouped based on homogeneous gait patterns, however, no previous study has assessed the presence of such subgroups in a population of individuals across a wide variety ...