docetl

Powering complex document processing pipelines

New blog post! September 24, 2024

GitHubDocs

US Presidential Debate Analysis

Powered by DocETL: a declarative system for LLM-powered data processing pipelines. Define pipelines in YAML, optimize automatically, and seamlessly integrate LLM and non-LLM operations.

This pipeline analyzes themes in US presidential debates dating back to 1960, summarizing the evolution of the viewpoints of Democrats and Republicans for each theme. It cost us $0.29 to run (and $0.86 to optimize).

The combined debate transcripts span 738,094 words, making it tricky to analyze in a single prompt. For example, when given the entire dataset in a single prompt, Gemini-1.5-Pro-002 (released September 24, 2024) only reports on the evolution of five themes (throughout all the documents).

Map Each Transcript to Themes

(LLM)

Covers 339 distinct themes

Unnest Themes

(Non-LLM)

Deduplicate and Merge Themes

(LLM)

Reduces number of themes by 55%

Summarize Themes Over Time

(LLM)

Generates 152 reports averaging 730 words each

DocETL generates comprehensive reports for 152 distinct themes, each analyzing the evolution of Democratic and Republican viewpoints over time. You can explore the reports by selecting a theme from the dropdown menu.

UC Berkeley LogoEPIC Lab Logo