Tf–idf

Content sourced from Wikipedia, licensed under CC BY-SA 3.0.

tf-idf stands for term frequency–inverse document frequency. It is a way to measure how important a word is in a document, taking into account how common the word is across a whole collection of documents.

Key ideas
- Bag of words: A document is thought of as a collection of words, ignoring word order.
- Term frequency (tf): How often a word appears in the document. More occurrences mean higher tf.
- Inverse document frequency (idf): How rare the word is across all documents. If a word appears in many documents, its idf is low; if it appears in few documents, its idf is high.
- tf-idf: Multiply tf by idf. A high score means the word is frequent in the document but not common in the whole collection.

Why it’s useful
- Helps rank documents by relevance to a query in search engines.
- Used in text mining, information retrieval, and user modeling.
- Also used in search engine optimization to analyze which words are important on a page.

Intuition with an example
- A word like “Romeo” appears in only a few plays, so it has a high idf and is informative for identifying the play.
- A word like “good” appears in almost every play, so it has a low idf and carries little about which play it is.

History and variations
- The idea of idf as “term specificity” was introduced by Karen Spärck Jones in 1972.
- Variants and extensions include:
- TF-PDF: uses differences across domains to weigh terms.
- TF-IDuF: idf based on a user’s personal document collection.
- DELTA TF-IDF: weights terms differently across classes (for example, positive vs negative sentiment).

Notes
- tf–idf is a heuristic that works well in many cases, but its theoretical foundations are nuanced.
- It remains a simple and powerful tool for measuring term importance in text.

This page was last edited on 2 February 2026, at 09:44 (CET).