Improved Delivery Coverage
TL;DR: Folder names stay the same and represent when the file was uploaded (UTC). Signals between that date and the previous folder should be unique. We now ship stragglers that were previously lost to processing delays. Every signal already has a unique signal_id, so customers who deduplicate on it will automatically get more data with no code changes.
What's changing
Better coverage, more signals.
Our pipeline now tracks each entity individually to ensure signals aren't lost between delivery cycles. Previously, processing delays could cause signals to fall through the cracks permanently. Now they ship in the next available delivery.
This was a silent data loss problem.
Signals that hit processing delays (LLM retries, late third-party data, API failures) would simply never appear. With this change, those signals land in the next batch. Since signal_id is already globally unique, customers who deduplicate on it benefit automatically.
We considered changing folder namingbut decided against it. The current format works and signal-level date fields (e.g. data.filing_date on SEC filings, data.posted_date on LinkedIn posts) handle time-series filtering better than any folder name could. See individual signal schema pages for the relevant date field per signal type.
How to use deliveries
- Pull the latest folder. Everything inside is new since the previous folder.
- Deduplicate on
signal_id. Globally unique, never repeated across deliveries. - For time-series filtering, use the signal's own date field, not the folder name.
What stays the same
- Delivery schedule, frequency, and buckets
- File format (
output.jsonl+output.parquet) - Authentication and service accounts
- Signal schema and field names
- Folder name format (
YYYY-MM-DD-HH-MM-SS)
