Stanford researcher releases SEFD dataset for machine-readable SEC filings
The 152 billion token dataset reconstructs decades of EDGAR filings in a format that actually preserves the financial details AI models need
If you’ve ever tried to extract useful data from SEC filings, you know the experience sits somewhere between reading hieroglyphics and assembling IKEA furniture without the manual. The documents are dense, inconsistently formatted, and designed for human lawyers, not machine learning models.
A team from Stanford’s Advanced Financial Technologies Lab just dropped something that could change that. The Stanford EDGAR Filings Dataset, or SEFD, is a massive reconstruction of US SEC EDGAR filings spanning from 1994 to the present, reformatted into a layout-faithful MultiMarkdown style that machines can actually parse without losing the financial meaning buried in the structure.
What makes this dataset different
The initial public snapshot contains 152 billion tokens covering filings from January 2022 to June 2025. The full dataset, when complete, is estimated to reach roughly 550 billion tokens drawn from approximately 18.5 million filings.
The project was led by Nick Bettencourt, affiliated with UCLA and collaborating with Stanford. It was announced on June 16, 2026.
Past extraction efforts routinely destroyed the structural and semantic components that make financial documents useful. Table hierarchies got flattened. Numeric signs disappeared. The subtle formatting that tells an analyst whether a number is a subtotal, a negative adjustment, or a footnote reference got stripped away.
SEFD’s MultiMarkdown approach preserves those elements. The team reports that structural accuracy exceeds 99% based on human evaluations. Even small errors in financial data, a misplaced negative sign, a collapsed table hierarchy, can cascade into meaningfully wrong conclusions when processed by AI models.
Another notable detail: less than 0.1% overlap with Common Crawl-derived corpora. Most large language models are pretrained on massive internet scrapes, and Common Crawl is one of the biggest. Having almost zero overlap means SEFD offers genuinely novel training data that won’t just reinforce what models have already seen.
New benchmarks for financial AI
The dataset didn’t arrive alone. The team also introduced two benchmarks designed to test how well models can work with this kind of data.
EDGAR-Forecast is a numerical forecasting benchmark. It tests whether models can look at historical filing data and predict future financial metrics. EDGAR-OCR focuses on financial table transcription, essentially measuring how accurately a model can read and reproduce the structured tables that form the backbone of most SEC filings.
Why crypto investors should pay attention
An increasing number of publicly traded companies now hold Bitcoin on their balance sheets, issue crypto-related securities, or operate in the digital asset space. Their SEC filings contain disclosures about those activities. Better AI tools for analyzing those filings mean better tools for understanding what traditional finance companies are actually doing with crypto, how they’re accounting for it, and what risks they’re flagging to regulators.
The financial data industry is dominated by players like Bloomberg and Refinitiv that charge premium prices for structured data feeds. An open, high-quality dataset of 550 billion tokens of SEC filings could democratize access to the raw material that powers financial analysis.
The risk, as always with open datasets, is misuse. A 99% structural accuracy rate is impressive, but that remaining sub-1% error rate across 18.5 million filings still represents a non-trivial number of potential inaccuracies. Anyone building production systems on SEFD will need robust validation layers, especially in domains like crypto where regulatory filings are already less standardized than traditional finance.