Suffix tree clustering
Suffix Tree Clustering (STC) is a simple clustering method that uses a suffix tree to find shared phrases among documents. It stores all phrases (n-grams) from the documents and groups documents that share long phrases into the same cluster. New documents can be added in a streaming, linear way without reprocessing everything. STC can produce many clusters, and the clustering can be done by either splitting clusters (decompositional) or by merging them (agglomerative), depending on the data. A drawback is that handling very large data sets may require examining many documents to form the clusters.
This page was last edited on 3 February 2026, at 10:03 (CET).