Skip to content

markowitz.data_providers.sp500_universe

markowitz.data_providers.sp500_universe

Survivorship-bias-aware S&P 500 membership builder.

A truly bias-free constituent history would require a paid index-rebalance feed (S&P Dow Jones Indices, CRSP). We approximate it instead by intersecting a recent snapshot of the index members with the Polygon grouped-daily snapshot on each as-of date: a ticker counts as a member iff it appears in :data:CURRENT_SP500 AND has a real trading bar on the requested day.

This is materially better than the naive "today's-list on yesterday's-date" approach because:

  • Symbols that had not yet IPO'd by the as-of date drop out (no grouped-daily row), which prevents look-ahead leakage from the modern constituent list.
  • Every returned ticker is guaranteed to have same-day OHLCV available, which is the dominant correctness concern in walk-forward backtests.
Known limitations
  • Companies that were once in the index but have since been delisted or acquired (Lehman, EMC, Sprint, ...) are missing. That is the pure "survivor" blind spot and biases backtests upward on average.
  • Modern names added to the index after they had been trading publicly (e.g. mid-2010s tech IPOs) are over-included before their real entry date.

When no Polygon provider is supplied the builder emits a warning and returns the static :data:CURRENT_SP500 list as-is — that path is explicitly survivorship-biased and should only be used for offline demos.

Caching

Membership lists are cached in-memory keyed by the as-of date. There is no DB write — long-running services that need persistence should layer their own store on top.

SP500UniverseBuilder(provider: PolygonProvider | YFinanceProvider | None = None)

Builds and caches point-in-time S&P 500 membership snapshots.

Parameters:

Name Type Description Default
provider PolygonProvider | YFinanceProvider | None

Either a :class:PolygonProvider (preferred — enables PIT membership via grouped-daily) or a :class:YFinanceProvider (fallback — emits a warning and returns the survivorship-biased static list).

None
Source code in src/markowitz/data_providers/sp500_universe.py
def __init__(
    self,
    provider: PolygonProvider | YFinanceProvider | None = None,
) -> None:
    self._provider = provider
    self._cache: dict[date, list[str]] = {}
    self._warned_fallback = False

get_membership_as_of(date_: date) -> list[str]

Return the approximated S&P 500 membership on date_.

When the configured provider exposes a working get_grouped_daily (Polygon path), the result is the intersection of :data:CURRENT_SP500 with the symbols that actually traded on date_. When it does not (no provider, yfinance fallback, or grouped-daily empty), the static list is returned and a warning is emitted on the first such call.

Source code in src/markowitz/data_providers/sp500_universe.py
def get_membership_as_of(self, date_: date) -> list[str]:
    """Return the approximated S&P 500 membership on ``date_``.

    When the configured provider exposes a working ``get_grouped_daily``
    (Polygon path), the result is the intersection of :data:`CURRENT_SP500`
    with the symbols that actually traded on ``date_``. When it does not
    (no provider, yfinance fallback, or grouped-daily empty), the static
    list is returned and a warning is emitted on the first such call.
    """
    cached = self._cache.get(date_)
    if cached is not None:
        return list(cached)

    if self._provider is None:
        members = self._fallback_members()
    else:
        try:
            grouped = self._provider.get_grouped_daily(date_)
        except Exception as exc:
            logger.warning(
                "grouped-daily fetch failed for %s (%s); using static list",
                date_,
                exc,
            )
            members = self._fallback_members()
        else:
            if grouped.empty:
                logger.warning(
                    "grouped-daily empty for %s; returning empty membership",
                    date_,
                )
                members = []
            else:
                active = set(grouped.index)
                members = sorted(t for t in CURRENT_SP500 if t in active)

    self._cache[date_] = list(members)
    return list(members)

get_membership_window(start: date, end: date, freq: str = 'ME') -> dict[date, list[str]]

Build membership at each rebalance date in [start, end].

freq follows pandas offset aliases; default ME = month-end, matching the cadence used by most monthly walk-forward backtests. When [start, end] is shorter than one period the window degenerates to {start, end} so callers always get at least two anchors back.

Source code in src/markowitz/data_providers/sp500_universe.py
def get_membership_window(
    self,
    start: date,
    end: date,
    freq: str = "ME",
) -> dict[date, list[str]]:
    """Build membership at each rebalance date in ``[start, end]``.

    ``freq`` follows pandas offset aliases; default ``ME`` = month-end,
    matching the cadence used by most monthly walk-forward backtests.
    When ``[start, end]`` is shorter than one period the window degenerates
    to ``{start, end}`` so callers always get at least two anchors back.
    """
    if start > end:
        raise ValueError(f"start ({start}) must be <= end ({end})")
    anchors = pd.date_range(start=start, end=end, freq=freq)
    if len(anchors) == 0:
        anchors = pd.DatetimeIndex(
            [pd.Timestamp(start), pd.Timestamp(end)]
        ).unique()
    result: dict[date, list[str]] = {}
    for ts in anchors:
        d = ts.date() if hasattr(ts, "date") else ts
        result[d] = self.get_membership_as_of(d)
    return result