markowitz.data¶
markowitz.data
¶
Data layer for markowitz-optimizer.
Public surface area (re-exported here for ergonomic imports):
- Loaders: :func:
load_prices, :func:compute_returns, :func:align_universe, :func:summary_stats - Risk-free: :func:
risk_free_rate, :func:de_annualize - Factors: :func:
load_french_factors - Calendars: :func:
trading_sessions, :func:reindex_to_sessions - Cache: :func:
cache_path, :func:read_cache, :func:write_cache - Providers: :class:
PriceProvider, :class:YFinanceProvider, :class:CachedProvider, :class:FallbackProvider - Exceptions: :class:
DataLayerErrorand all subclasses
CacheCorruptError
¶
Bases: DataLayerError
Raised when a cache file exists but cannot be deserialized.
Note: :func:markowitz.data.cache.read_cache deliberately does not
raise this — it warns and returns None so callers can transparently
fall back to a fresh fetch. The exception exists for explicit callers
that want hard failures.
CacheCorruptionWarning
¶
Bases: UserWarning
Emitted when a cache file exists but cannot be deserialized.
CachedProvider(inner: PriceProvider, root: str | os.PathLike[str])
¶
CalendarMismatchError
¶
Bases: DataLayerError
Raised when an index does not align with the expected trading calendar.
DataIntegrityError
¶
Bases: DataLayerError
Raised when fetched data violates structural invariants.
Examples include duplicate timestamps, non-monotonic indices, unexpected NaNs in critical columns, or timezone inconsistencies.
DataLayerError
¶
Bases: Exception
Base class for all errors raised by the markowitz.data package.
EmptyDataError
¶
Bases: DataLayerError
Raised when a provider returns no rows for the requested window.
FallbackProvider(providers: list[PriceProvider])
¶
Try each provider in order, advancing past rate-limit / unavailable.
Source code in src/markowitz/data/providers.py
InsufficientDataError
¶
Bases: DataLayerError
Raised when fewer observations are available than required.
PriceProvider
¶
Bases: Protocol
Protocol every price source must satisfy.
fetch(ticker: str, start: DateLike, end: DateLike, *, frequency: str = '1d') -> pd.DataFrame
¶
Return a frame indexed by tz-naive midnight, single close column.
ProviderUnavailableError
¶
Bases: DataLayerError
Raised when an upstream data provider is unreachable or misconfigured.
This includes optional dependencies that are not installed, missing credentials, transient network failures, and HTTP 5xx responses.
RateLimitError
¶
Bases: DataLayerError
Raised when an upstream provider signals a rate-limit condition.
YFinanceProvider()
¶
Provider backed by yfinance.download.
Source code in src/markowitz/data/providers.py
align_universe(returns: pd.DataFrame, *, min_obs_per_ticker: int = 252, common_window: bool = True) -> pd.DataFrame
¶
Drop tickers with sparse history; optionally restrict to common window.
Source code in src/markowitz/data/loaders.py
annualize_rate(per_period: pd.Series, frequency: str) -> pd.Series
¶
Inverse of :func:de_annualize. Provided for round-trip testing.
cache_path(root: str | os.PathLike[str], ticker: str, source: str, frequency: str) -> Path
¶
Return the canonical cache file path for a (ticker, source, frequency) triple.
The path is <root>/<source>/<frequency>/<ticker>.parquet with
each component sanitized. The parent directories are not created
by this function.
Source code in src/markowitz/data/cache.py
compute_returns(prices: pd.DataFrame, *, method: str = 'simple', dropna: bool = True) -> pd.DataFrame
¶
Convert a price panel into period-over-period returns.
method="simple" is the Markowitz default. method="log" is
provided for portfolio diagnostics and for users who prefer
continuously compounded returns.
Source code in src/markowitz/data/loaders.py
constant_rate(value: float, index: pd.DatetimeIndex) -> pd.Series
¶
Construct a constant-rate series — useful for tests and quick demos.
Source code in src/markowitz/data/risk_free.py
de_annualize(annual_pct: pd.Series, frequency: str) -> pd.Series
¶
Convert an annual rate (already in decimal, e.g. 0.05) to per-period.
Uses geometric de-annualization::
r_period = (1 + r_annual) ** (1 / periods_per_year) - 1
so a round-trip via the matching annualizer is exact.
Source code in src/markowitz/data/risk_free.py
load_french_factors(dataset: str, *, frequency: str = 'monthly', cache_root: str | os.PathLike[str] | None = None) -> pd.DataFrame
¶
Download (or read from cache) a Ken French factor table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
str
|
Dataset identifier, e.g. |
required |
frequency
|
str
|
|
'monthly'
|
cache_root
|
str | PathLike[str] | None
|
Optional directory used to cache the raw ZIP payload. |
None
|
Source code in src/markowitz/data/french.py
load_prices(tickers: str | Iterable[str], start: DateLike, end: DateLike, *, provider: PriceProvider | None = None, frequency: str = '1d', min_history_sessions: int = 252) -> pd.DataFrame
¶
Fetch adjusted close prices for tickers as a wide DataFrame.
Each column is a ticker; the index is a tz-naive, monotonic,
unique :class:~pandas.DatetimeIndex aligned across tickers via an
outer join. Tickers without at least min_history_sessions
non-NaN observations raise :class:InsufficientDataError.
Source code in src/markowitz/data/loaders.py
read_cache(root: str | os.PathLike[str], ticker: str, source: str, frequency: str) -> pd.DataFrame | None
¶
Read a cached dataframe, returning None on miss or corruption.
This function never raises — corruption is signalled via a
:class:CacheCorruptionWarning and a None return so callers can
transparently fall back to re-fetching.
Source code in src/markowitz/data/cache.py
reindex_to_sessions(df: pd.DataFrame, *, exchange: str = 'XNYS', on_missing: str = 'raise', max_gap_sessions: int = 1) -> pd.DataFrame
¶
Align df to the canonical session calendar.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input frame indexed by tz-naive midnight timestamps. |
required |
exchange
|
str
|
Exchange MIC code understood by |
'XNYS'
|
on_missing
|
str
|
|
'raise'
|
max_gap_sessions
|
int
|
Maximum number of consecutive missing sessions tolerated when
|
1
|
Source code in src/markowitz/data/calendars.py
risk_free_rate(start: DateLike, end: DateLike, *, series_id: str = 'DGS1MO', frequency: str = 'daily', api_key: str | None = None) -> pd.Series
¶
Fetch a FRED risk-free series and return it at frequency.
The FRED yield series are quoted in annualized percent (e.g. 5.32
means 5.32% per year). This function converts to decimal,
forward-fills any missing days inside the window, and resamples /
de-annualizes to the requested frequency.
Source code in src/markowitz/data/risk_free.py
summary_stats(returns: pd.DataFrame, *, frequency: str = 'daily', annualize: bool = True) -> pd.DataFrame
¶
Per-asset summary statistics used as covariance / mean targets.
Returned columns: mu_hat, sigma_hat, sharpe, skew,
kurtosis, n_obs, min, max.
When annualize=True (default), mu_hat is scaled by the
periods-per-year factor and sigma_hat by its square root.
Source code in src/markowitz/data/loaders.py
trading_sessions(start: str | _dt.date | pd.Timestamp, end: str | _dt.date | pd.Timestamp, *, exchange: str = 'XNYS') -> pd.DatetimeIndex
¶
Return tz-naive midnight DatetimeIndex of trading sessions.
Uses :mod:pandas_market_calendars when installed. Falls back to a
NYSE weekday-minus-fixed-holidays approximation otherwise.
Source code in src/markowitz/data/calendars.py
write_cache(df: pd.DataFrame, root: str | os.PathLike[str], ticker: str, source: str, frequency: str) -> Path
¶
Atomically write df to the cache and return the final path.
The dataframe is serialized via PyArrow Parquet. The write is
performed to a temporary file in the same directory as the target,
then renamed with :func:os.replace, which is atomic on POSIX and
on Windows (Python 3.3+) when source and destination are on the
same volume.