Google Tech Talks
May 24, 2007
A surge of recent research in machine learning and statistics has developed new techniques for finding patterns of words in document collections using hierarchical probabilistic models. These models are called “topic models” because the word patterns often reflect the underlying topics that are combined to form the documents; however topic models also naturally apply to such data as images and biological sequences.
While previous topic models have assumed that the corpus is static, many document collections actually change over time: scientific articles, emails, and search queries reflect evolving content, and it is important to model the corresponding evolution of the underlying topics. For example, an article about biology in 1885 will exhibit significantly different word frequencies than one in 2005. After reviewing the basics of topic models, I will describe probabilistic models designed to capture the dynamics of topics as they evolve over time.
In addition to giving quantitative, predictive models of a corpus, topic models provide a qualitative window into the structure of a large document collection. This perspective allows a user to explore a corpus in a topic-guided fashion. We demonstrate the capabilities of the dynamic topic model on the archives of the journal Science, founded in 1880 by Thomas Edison. Our models are built on the noisy text from JSTOR, an online scholarly journal archive, resulting from an optical character recognition engine run over the original bound journals.
(joint work with J. Lafferty)
Speaker: David Blei, Princeton University
Speaker: David Blei