40 Exercises on Stock Data Processing with Polars

Starting Here...¶

Why Polars, Not Pandas?

If you’re coming from a Pandas background, you might wonder why this book uses Polars instead. Here are the key reasons:

1. Speed and Performance

Polars is significantly faster than Pandas, especially on large datasets. It’s built from scratch in Rust with performance in mind, leveraging:

Parallel execution: Automatically uses all CPU cores
Lazy evaluation: Optimizes the entire query before execution
Zero-copy operations: Minimizes memory overhead
SIMD vectorization: Faster numerical operations

In practical terms, Polars is often 5-10x faster than Pandas for typical data wrangling tasks, and the performance gap widens with larger datasets.

2. Modern and Intuitive API

Polars offers a clean, expressive API with method chaining that feels natural:

(
    data
    .filter(c.price > 100)
    .with_columns(return_pct=c.close / c.open - 1)
    .group_by(c.date)
    .agg(avg_return=c.return_pct.mean())
)

3. Better Memory Management

Unlike Pandas, Polars:

Uses Apache Arrow memory format for efficient data sharing
Handles out-of-memory datasets with lazy evaluation
Has no index confusion (a common pain point in Pandas)

4. Strong Type System

Polars has a strict type system that catches errors early and makes your code more robust. No more silent type conversions or unexpected NaN propagation.

5. Production-Ready

While Polars is relatively newer, it’s production-ready and actively maintained with a vibrant community. Many companies have already migrated from Pandas to Polars for production workloads.

When to Still Use Pandas?

That said, Pandas still has its place:

When you need compatibility with older codebases
When using libraries that only support Pandas DataFrames
For quick one-off analyses where performance doesn’t matter

For business research and data-intensive tasks, Polars is the superior choice in 2026.

About the Code Comments

For instruction purpose, I tried to provide as many comments as possible, which may render the code unnecessarily long and scary. For example (Question 6):

(
    data
    # filter stocks
    .filter((c.retx >= 0.05) | (c.retx <= -0.05))
    # create `direction` to indicate price change direction
    .with_columns(
        direction=pl.when(c.retx >= 0.05)
        .then('up5%+')
        .otherwise('down5%-')
    )
    # count the stocks
    .group_by(c.date, c.direction)
    .agg(num_stk=c.permno.n_unique())
    .head(5)
)

However, if you take out the comments and rearrange the code, it’ll be much more friendly:

(
    data
    .filter((c.retx >= 0.05) | (c.retx <= -0.05))
    .with_columns(direction=pl.when(c.retx >= 0.05).then('up5%+').otherwise('down5%-'))
    .group_by(c.date, c.direction)
    .agg(num_stk=c.permno.n_unique())
    .head(5)
)

Column Reference Methods in Polars

Polars provides several ways to reference columns, all of which are valid:

Method 1: String reference

pl.col('var')

Method 2: Dot notation

pl.col.var

Method 3: Shortcut alias (used in this book)

from polars import col as c
c.var

Why we use c.var in this book:

While all three methods work identically, we use the shortcut c.var throughout this book because:

Fewer keystrokes: c.var is shorter than pl.col('var') or pl.col.var
Cleaner code: Makes complex expressions more readable
Consistent style: All examples use the same pattern

Important: Make sure you have from polars import col as c in your imports (see the “Load Packages” section) for c.var to work!

Load the Packages¶

We will only use polars in this chapter.

import polars as pl

from polars import col as c

Inspect the Data¶

We select the price and trading data of all US common stocks from the CRSP database between 2023-01-01 and 2023-06-30. Each row represents a stock’s daily observation. There’re in total 4221 unique stocks with about 465k rows.

# read data
data = pl.read_ipc("data/crsp.feather", memory_map=False, use_pyarrow=True)
data = data.with_columns(c.permno.cast(int))

# print the first 3 rows
data.head(3)

This is “panel data,” meaning it contains multiple stocks (cross-sectional) with each stock having multiple observations ordered by date (time-series).
Each combination of permno (stock ID) and date determines a unique observation, i.e., they’re the key.
The dataset is first sorted by permno, and then date
Column definition:
- permno: The unique ID provided by CRSP.
- date: Date
- ticker: Ticker. Note that the ticker of a stock can change, and hence it’s not recommended to use ticker as stock ID.
- industry: The industry classification determined using the first two digits of NAICS industry classification code. See the official document for more details.
- is_sp500: True if the stock is in the S&P 500 list by mid-2023; False if not.
- open: Daily open
- high: Daily high
- low: Daily low
- cap: Daily market capital (in $1000)
- dlyfacprc: Daily factor to adjust for price. (We won’t use it in the practice.)
- vol: Trading volume (number of shares traded)
- prcvol: Dollar trading volume, calculated as the number of shares traded multiplied by the price per share.
- retx: Daily return (excluding dividend).

Find out all stocks whose ticker contains `"COM"`¶

Why This Matters:

String pattern matching on ticker symbols is a fundamental skill in financial data analysis that goes beyond simple filtering. In practice, analysts often need to identify groups of companies based on naming conventions—for example, ETFs often use systematic ticker patterns (SPY, QQQ), preferred shares may contain “PR” in their symbols, or ADRs might follow specific naming schemes. Additionally, when working with corporate actions like mergers, spin-offs, or ticker changes, you may need to track down securities using partial name matches.

(
    data
    .filter(c.ticker.str.contains('COM'))
    .select([c.permno, c.ticker])
    .unique()
)

We use Polars’ string methods to find out all tickers that contain "COM". But that’s only half the way. Without deduplicating, we will end up with numerous duplicate observations whose permno and ticker are the same but only differ in date.

Execution explained:

First, we use .filter(c.ticker.str.contains('COM')) to filter out the rows where the ticker contains "COM". The .str.contains() method is used for string pattern matching in Polars.
Then, we select only two columns, permno and ticker, using .select([c.permno, c.ticker]). By this point, we have many duplicates because the same stock appears on multiple dates.
Lastly, we use .unique() to remove the duplicate rows, keeping only distinct combinations of permno and ticker.

How many stocks on each day have positive/negative returns?¶

Since we want to count the stocks on each day, We need to first group all stocks by date and then by the return sign (positive vs. negative). Next, we need to generate two counts: one for the up (positive return) and the other for the down (negative). We leverage the sign of retx (return) to determine the direction of price change.

# step 1: create a new column "direction" which indicates the direction of the price change
# `sign` returns 1 for positive values, 0 for 0, and -1 for negative values
data_q2 = data.with_columns(direction=c.retx.sign())

# step 2: count the stocks with +/- returns on each day
(
    data_q2
    .group_by(c.date, c.direction)
    .agg(num=c.permno.n_unique())
    .sort(c.date)
    .head(5)
)

Output Explained:

On each day we have three observations (rows), one for each direction (1, 0, and -1) of the price change. For example, on 2023-01-03, there’re 1844 stocks with positive return.

Execution Explained:

In Step 1, we create a new column direction to indicate whether we’re counting the up/down/no-move stocks for each day. Its value is 1, 0, or -1, depending on the sign of retx.
In Step 2, I group all observations by date (since we want statistics for each day) and then direction. .n_unique() is used to count the unique stock IDs within each group.

However, we don’t need to split the code explicitly. One great thing about Polars is that it can chain multiple computation steps together using method chaining. In the following code, I create the direction column inline within the grouping operation.

# chain the two steps together (same results)
(
    data
    .with_columns(direction=c.retx.sign())
    .filter(c.direction.is_not_null())
    .group_by(c.date, c.direction)
    .agg(num=c.permno.n_unique())
    .sort(c.date)
    .head(5)
)

How many stocks on each day have positive/negative returns (only select stocks with market capital larger than $100M)?¶

This question is very similar to the previous, the only change is to use cap to filter stocks by market capital. Note that cap is in 1000 dollars.

# we only add the condition "cap>=1e5" to the first step
# the rest is the same

(
    data
    .filter(c.cap >= 1e5)
    .with_columns(direction=c.retx.sign())
    .group_by(c.date, c.direction)
    .agg(num=c.permno.n_unique())
    .sort(c.date)
    .head(5)
)

How many stocks are there in each industry on each day?¶

(
    data
    .group_by(c.date, c.industry)
    .agg(num=c.permno.n_unique())
    # sort the result by `date` and `num`
    .sort(c.date, c.num)
    .head(3)
)

Execution Explained:

We first group by date and industry, then we count the unique stock IDs within each group using .n_unique()

On each day, is the industry with the most stocks also generates the largest trading volume?¶

Why This Matters:

Stock count and trading volume measure fundamentally different aspects of industry composition and market activity. An industry with many constituent stocks doesn’t necessarily dominate trading volume if those stocks are small-cap or illiquid names. This distinction is crucial for understanding market concentration versus market fragmentation. For instance, the technology sector might have fewer stocks but massive trading volume due to mega-caps like Apple and Microsoft, while industrials might have dozens of mid-sized companies with modest individual liquidity. This analysis helps portfolio managers assess whether an industry’s activity is concentrated in a few leaders (high volume, few stocks) or distributed across many participants (many stocks, dispersed volume).

(
    data
    # Step 1: Compute total trading volume and stock count for each industry on each day
    .group_by(c.date, c.industry)
    .agg(
        prcvol=c.prcvol.sum(),        # Total dollar trading volume for each industry
        num_stk=c.permno.n_unique()    # Number of unique stocks in each industry
    )
    .sort(c.date)
    # Step 2: Create indicator columns to identify industries with max stock count and max volume
    .with_columns(
        is_max_num_stk=(c.num_stk == c.num_stk.max().over(c.date)),  # True if this industry has the most stocks
        is_max_prcvol=(c.prcvol == c.prcvol.max().over(c.date))      # True if this industry has the most volume
    )
    # Step 3: Filter to keep only industries with the maximum number of stocks on each day
    .filter(c.is_max_num_stk)
    .head(5)
)

Execution Explained:

Step 1: We group by date and industry to aggregate stock-level data to the industry level. For each (date, industry) pair, we calculate:
- prcvol: Total dollar trading volume (sum of prcvol across all stocks in that industry)
- num_stk: Number of unique stocks (using permno.n_unique())
After sorting by date, we have one row per industry per day showing how many stocks it contains and its total trading volume.
Step 2: We use .over(c.date) to create two boolean indicator columns within each date group:
- is_max_num_stk: True if this industry has the maximum number of stocks on that day (comparing num_stk to the max across all industries on the same date)
- is_max_prcvol: True if this industry has the maximum trading volume on that day
These indicators help us identify which industry dominates in stock count versus trading volume.
Step 3: We filter to keep only rows where is_max_num_stk == True, which selects the industry with the most stocks on each day. The is_max_prcvol column then reveals whether this stock-heavy industry also generates the most trading volume.

The output shows that “Manufacturing” consistently has the most stocks and also generates the largest trading volume, suggesting both high stock count and high market activity concentration in this sector.

How many stocks on each day have a gain of more than 5% and a loss more than 5%?¶

Similar to the previous, the key is to group the stocks by date and the direction of change.

(
    data
    # filter stocks
    .filter((c.retx >= 0.05) | (c.retx <= -0.05))
    # create `direction` to indicate price change direction
    .with_columns(
        direction=pl.when(c.retx >= 0.05)
        .then(pl.lit('up5%+'))
        .otherwise(pl.lit('down5%-'))
    )
    # count the stocks
    .group_by(c.date, c.direction)
    .agg(num_stk=c.permno.n_unique())
    .sort(c.date)
    .head(5)
)

Execution Explained:

.filter((c.retx >= 0.05) | (c.retx <= -0.05)): Only select observations with the required retx range. This will reduce the data size for the next step, and hence speed up the execution.
.with_columns(direction=pl.when(c.retx >= 0.05).then('up5%+').otherwise('down5%-')): We use .when() and .then() to indicate the direction: if retx >= 0.05 is met, then direction is “up5%+”, otherwise, the value is “down5%-”.

What are the daily total trading volume of the top 10 stocks with the largest daily increase and the top 10 stocks with the largest daily decrease?¶

Why This Matters:

Trading volume on extreme movers reveals important liquidity and sentiment dynamics. When stocks experience large daily swings, analyzing their trading volume helps distinguish between genuine conviction-driven moves versus thin, low-liquidity moves. High volume on big gainers suggests broad-based buying interest (bullish confirmation), while high volume on big losers indicates capitulation or selling pressure. Conversely, extreme moves on low volume might be noise or technical artifacts. This analysis is crucial for momentum traders who need to identify which extreme moves are backed by sufficient liquidity to support entry/exit, and for risk managers assessing whether market stress (evidenced by high volume on losers) is concentrated or widespread.

(
    data
    # Step 1: Sort by date and return to rank stocks by performance within each day
    .sort(c.date, c.retx)
    # Step 2: Group by date and compute total trading volume for extreme movers
    .group_by(c.date, maintain_order=True)
    .agg(
        # Bottom 10: stocks with the largest daily decrease (most negative returns)
        bottom10=c.prcvol.head(10).sum(),
        # Top 10: stocks with the largest daily increase (most positive returns)
        top10=c.prcvol.tail(10).sum()
    )
    .head(5)
)

Execution Explained:

Step 1: We sort the data by date (ascending) and then by retx (ascending). This dual-level sorting is crucial: within each date, stocks are ordered from worst performers (most negative returns) to best performers (most positive returns). This ordering enables us to easily identify extreme movers using .head() and .tail() operations.
Step 2: We group by date with maintain_order=True to preserve the sorting from Step 1. Within each date group, we calculate:
- bottom10: Total trading volume of the 10 worst performers
  - c.prcvol.head(10) selects the first 10 trading volume values in the sorted group
  - Since we sorted by retx ascending, these correspond to the 10 stocks with the most negative returns
  - .sum() aggregates their trading volumes to get total volume on big losers
- top10: Total trading volume of the 10 best performers
  - c.prcvol.tail(10) selects the last 10 trading volume values in the sorted group
  - These correspond to the 10 stocks with the most positive returns
  - .sum() aggregates their trading volumes to get total volume on big gainers

The output reveals daily patterns in liquidity concentration among extreme movers. For example, on 2023-01-04, top gainers saw $1.24B in volume versus$ 415M on losers, suggesting more conviction behind upward moves than downward ones.

How many stocks increase by 5% at the time of open and close?¶

Why This Matters:

Distinguishing between overnight gaps and intraday moves reveals crucial information about how news is incorporated into prices. Overnight returns (from previous close to open) capture reactions to after-hours news releases—earnings announcements, macro events, or geopolitical developments that occur when markets are closed. Intraday returns (from open to close) reflect information that emerges during trading hours or the market’s digestion of overnight news. Counting 5%+ moves in each category helps identify whether significant price action is driven by scheduled announcements (typically overnight reactions) versus unexpected intraday shocks or momentum. For traders, this matters because overnight gaps present different risk/reward profiles than intraday trends: gaps can’t be traded into, have no defined stop-loss levels from the previous day’s range, and may signal strong conviction if they don’t fade during the session.

(
    data
    # Step 1: Sort by stock and date to enable time-series operations
    .sort(c.permno, c.date)
    # Step 2: Calculate previous day's close price for each stock
    .with_columns(
        pre_close=c.close.shift(1).over(c.permno)  # Shift close price by 1 day within each stock
    )
    # Step 3: Calculate overnight return (previous close to open)
    .with_columns(
        ret_open=c.open / c.pre_close - 1  # Overnight gap return
    )
    # Step 4: Count stocks with 5%+ gains for both overnight and intraday
    .group_by(c.date)
    .agg(
        num_open=(c.ret_open >= 0.05).sum(),   # Count stocks with 5%+ overnight gap
        num_close=(c.retx >= 0.05).sum()       # Count stocks with 5%+ total daily return
    )
    .sort(c.date)
    .head(5)
)

Execution Explained:

Step 1: We sort by permno and date to ensure data is chronologically ordered within each stock. This ordering is essential for the .shift() operation in Step 2 to correctly identify the previous day’s price for each stock.
Step 2: We create pre_close using .shift(1).over(c.permno), which retrieves the previous trading day’s close price for each stock. The .over(c.permno) ensures the shift happens independently within each stock’s time series (so stock A’s previous close doesn’t accidentally reference stock B’s price). The parameter 1 means we shift back exactly one period.
Step 3: We calculate the overnight return ret_open as (open / pre_close) - 1. This captures the percentage change from yesterday’s close to today’s open—representing how the stock gapped overnight before trading began. This isolates the market’s reaction to after-hours news.
Step 4: We group by date and count stocks meeting our 5% threshold using two metrics:
- num_open: Counts stocks with ret_open >= 0.05 (5%+ overnight gap)
  - c.ret_open >= 0.05 creates a boolean series (True/False for each stock)
  - .sum() counts the True values (Polars treats True as 1 and False as 0)
- num_close: Counts stocks with retx >= 0.05 (5%+ total daily return from previous close to today’s close)
  - Note that retx includes both overnight gap and intraday movement

The output shows a striking pattern: on 2023-01-03, zero stocks gapped up 5% overnight, but 250 stocks ended the day up 5%+, indicating strong intraday buying. On 2023-01-04, 56 stocks gapped up overnight while 496 finished up 5%+, showing both overnight momentum and continued intraday strength.

What’s the correlation coefficient between trading volume and return?¶

Why This Matters:

The relationship between trading volume and returns is fundamental to understanding market microstructure and information flow. A strong positive correlation would suggest that high volume accompanies upward price moves (buying pressure dominates), while a negative correlation would indicate high volume on downward moves (selling pressure). However, volume can be high on both up and down days—what matters is whether abnormal volume (deviations from typical levels) predicts returns. We examine two relationships: (1) raw volume vs. returns, which captures whether absolute activity levels correlate with price direction, and (2) volume changes vs. returns, which isolates whether surprises in liquidity demand signal informed trading.

(
    data
    # Step 1: Prepare volume metrics and sort for time-series operations
    .sort(c.permno, c.date)
    .with_columns(
        vol=c.prcvol,                                          # Raw trading volume
        vol_change=c.prcvol - c.prcvol.shift(1).over(c.permno) # Change in volume from previous day
    )
    .select([c.permno, c.vol, c.vol_change, c.retx, c.cap])
    .drop_nulls()  # Remove rows with missing values (first day of each stock)
    # Step 2: Calculate correlation for each individual stock
    .group_by(c.permno)
    .agg(
        corr1=pl.corr('vol', 'retx'),          # Correlation: raw volume vs. return
        corr2=pl.corr('vol_change', 'retx'),   # Correlation: volume change vs. return
        weight=c.cap.mean()                     # Average market cap as weight
    )
    .drop_nulls()
    .drop_nans()
    # Step 3: Compute market-cap weighted average of correlations
    .select(
        corr1=(c.corr1 * c.weight).sum() / c.weight.sum(),  # Weighted avg of corr1
        corr2=(c.corr2 * c.weight).sum() / c.weight.sum()   # Weighted avg of corr2
    )
)

Execution Explained:

Step 1: We prepare the data and create two volume metrics:
- First, we sort by permno and date to ensure chronological ordering within each stock, which is essential for the .shift() operation
- vol = c.prcvol: Raw dollar trading volume (no transformation)
- vol_change = c.prcvol - c.prcvol.shift(1).over(c.permno): Day-over-day change in trading volume
  - .shift(1) gets yesterday’s volume for each stock
  - .over(c.permno) ensures the shift happens independently within each stock’s time series
We select only the columns needed for correlation analysis (permno, vol, vol_change, retx, cap) and use .drop_nulls() to remove rows with missing values (which occur on the first trading day of each stock due to the shift operation).
Step 2: We calculate correlations at the individual stock level by grouping by permno:
- corr1 = pl.corr('vol', 'retx'): Correlation between raw volume and returns for each stock
  - Tests whether high absolute volume levels predict return direction
- corr2 = pl.corr('vol_change', 'retx'): Correlation between volume changes and returns for each stock
  - Tests whether abnormal volume (deviations from typical levels) predicts returns
- weight = c.cap.mean(): Average market capitalization for each stock, used as weights in Step 3
We drop nulls and NaNs to handle stocks with insufficient data for correlation calculation.
Step 3: We compute the market-cap weighted average of correlations across all stocks:
$\text{weighted corr} = \frac{\sum(\text{individual corr} \times \text{market cap})}{\sum(\text{market cap})}$
This weighting scheme ensures large-cap stocks (which dominate market activity) contribute more to the final correlation than small-caps. For example, Apple’s volume-return relationship matters more than a micro-cap’s.

The output shows both correlations are very weak (~0.05), suggesting trading volume has minimal linear relationship with stock returns. This implies that neither high absolute volume nor volume surges reliably predict price direction—volume is as likely to accompany gains as losses.

What’s the correlation between industry-level trading volume changes and return dispersion within industries?¶

Why This Matters:

When industry-wide trading volume spikes (e.g., due to regulatory news, earnings season, or macro shocks), do stocks within that industry move together or diverge? A strong positive correlation would suggest that increased trading activity accompanies investor disagreement - some stocks get bid up while others get sold off, creating return dispersion. This is valuable for:

For instance, during an interest rate hike announcement, banking sector volume might surge, but retail banks could rise 5% while investment banks fall 3%, creating high return dispersion despite sector-wide attention.

(
    data
    # Step 1: Group by date and industry to compute aggregates
    .group_by(c.date, c.industry)
    .agg(
        # Compute total trading volume for each industry on each day
        amount_ind=c.prcvol.sum(),
        # Compute return dispersion (std) for each industry on each day
        ret_std=c.retx.std()
    )
    # Step 2: Sort by industry and date (important for shift to work correctly!)
    .sort(c.industry, c.date)
    # Step 3: Calculate the percentage change in industry-level trading volume
    .with_columns(
        # shift(1) gets the previous day's volume for each industry
        amount_ind_change=(c.amount_ind / c.amount_ind.shift(1).over(c.industry) - 1)
    )
    .drop_nulls()
    # Step 4: Group by industry to compute correlation for each industry separately
    .group_by(c.industry)
    .agg(
        # Compute the correlation between volume change and return dispersion
        corr=pl.corr('amount_ind_change', 'ret_std')
    )
    .sort(c.corr, descending=True)
)

Execution Explained:

Step 1: We use .group_by(['date', 'industry']) to aggregate data to the (date, industry) level. This gives us one row per industry per day with:
- amount_ind: total dollar trading volume for that industry on that day
- ret_std: standard deviation of returns within that industry on that day (measuring return dispersion)
Step 2: We sort by ['industry', 'date'] to ensure the data is properly ordered for time-series operations. This is crucial for .shift() to work correctly.
Step 3: We calculate amount_ind_change using .shift(1).over('industry'), which gets yesterday’s volume for each industry. This gives us the percentage change in trading volume day-over-day.
Step 4: Finally, we group by industry to compute the correlation coefficient between amount_ind_change and ret_std for each industry across all time periods.

The result shows how strongly trading volume changes are associated with return dispersion within each industry.

What is the industry distribution of S&P 500 constituent stocks?¶

(
    data
    # Step 1: Filter to only include S&P 500 constituent stocks
    .filter(c.is_sp500 )
    # Step 2: Group by date and industry to count stocks in each industry
    .group_by(c.date, c.industry)
    .agg(
        # Count the number of unique stocks in each industry
        num_stocks=c.permno.n_unique()
    )
    # Step 3: Sort by date and number of stocks (descending)
    .sort(c.date, c.num_stocks, descending=[False, True])
    .head(10)
)

Execution Explained:

Step 1: We use .filter(c.is_sp500 == True) to select only the stocks that are S&P 500 constituents. The is_sp500 indicator is a boolean column that marks whether a stock is in the S&P 500 on a given day.
Step 2: We group by ['date', 'industry'] to aggregate stocks to the (date, industry) level. For each combination, we count the number of unique stocks using c.permno.n_unique(). This tells us how many stocks from each industry are in the S&P 500 on each day.
Step 3: We sort the results first by date (ascending) and then by num_stocks (descending) to see which industries have the most representation on each day. The .head(10) displays only the first 10 rows.

The output shows the industry composition of the S&P 500 over time. You can see which industries dominate the index and how this distribution might change.

If you want to see the industry distribution for a specific date (e.g., the most recent date), you can modify the code:

# Get the most recent date
latest_date = data.select(c.date.max()).item()

# Show industry distribution for the latest date
(
    data
    .filter((c.is_sp500 == True) & (c.date == latest_date))
    .group_by(c.industry)
    .agg(num_stocks=c.permno.n_unique())
    .sort(c.num_stocks, descending=True
)

This will give you a snapshot of how many stocks from each industry are in the S&P 500 on that particular day.

What’s the correlation matrix of daily returns across different industries?¶

Why This Matters:

The correlation matrix reveals Diversification Benefits: Industries with low or negative correlations provide better diversification - when one industry underperforms, another may offset losses. For example, technology and utilities often have lower correlations, making them good complements in a portfolio.

# Step 1: Calculate the industry-level weighted average return for each day
ret_ind = (
    data
    .group_by(c.industry, c.date)
    .agg(
        # Compute weighted average return using market cap as weights
        ret_ind=((c.retx * c.cap).sum() / c.cap.sum())
    )
    .sort(c.industry, c.date)
)

# Step 2: Pivot the data so each column represents one industry's returns
ret_ind_pivot = (
    ret_ind
    .pivot(
        index='date',      # Each row is a date
        on='industry',     # Each column is an industry
        values='ret_ind'   # Cell values are returns
    )
)

# Step 3: Compute the correlation matrix across all industries
corr_matrix = ret_ind_pivot.select(pl.all().exclude('date')).corr()

# Step 4: Add industry names as the first column for row labels
corr_matrix.insert_column(0, pl.Series('industry', corr_matrix.columns))

Execution Explained:

Step 1: We calculate industry-level returns by grouping by ['industry', 'date'] and computing the weighted average return for each industry on each day. We use market capitalization (cap) as weights: (retx * cap).sum() / cap.sum(). This gives more weight to larger companies within each industry, which better represents the industry’s overall performance.
Step 2: We use .pivot() to reshape the data from “long” format to “wide” format:
- Before pivot: Each row is an (industry, date) pair with columns [industry, date, ret_ind]
- After pivot: Each row is a date, and each column is an industry containing that industry’s return on that date
This transformation is necessary because correlation functions expect each column to represent a different variable (industry).
Step 3: We compute the correlation matrix using .corr(). We exclude the date column using .exclude('date') since we only want correlations between industries. The result is an N×N matrix where N is the number of industries, showing the pairwise correlation between each pair of industries.
Step 4: We add industry names as row labels by inserting a new column at position 0. We use .insert_column(0, pl.Series('industry', corr_matrix.columns)) to create an industry column containing the same names as the column headers. This makes the matrix much easier to read - now each row has a label showing which industry it represents.

The diagonal of the matrix is always 1.0 (an industry is perfectly correlated with itself). Off-diagonal values close to 1.0 indicate industries that tend to move together, while values close to 0 indicate independent movements.

Alternative: Chained Version

For a more compact implementation, you can chain the steps, though you’ll need to handle the row labels separately:

# Compute the correlation matrix
corr_matrix = (
    data
    # Step 1: Calculate industry-level weighted average return
    .group_by(c.industry, c.date)
    .agg(ret_ind=((c.retx * c.cap).sum() / c.cap.sum()))
    .sort(c.industry, c.date)
    # Step 2: Pivot to wide format
    .pivot(index='date', on='industry', values='ret_ind')
    # Step 3: Compute correlation matrix
    .select(pl.all().exclude('date'))
    .corr()
)

# Step 4: Add row labels
corr_matrix.insert_column(0, pl.Series('industry', corr_matrix.columns))

Which stocks have the largest weight in the S&P 500 index each day?¶

Why This Matters:

The S&P 500 is a market-cap-weighted index, meaning larger companies have disproportionate influence on index performance. Understanding which stocks dominate the index is crucial.

(
    data
    # Step 1: Filter to only S&P 500 constituent stocks
    .filter(c.is_sp500 == True)
    # Step 2: Calculate each stock's weight within the S&P 500 on each day
    # Weight = stock's market cap / total S&P 500 market cap
    .with_columns(
        sp500_weight=c.cap / c.cap.sum().over(c.date)
    )
    # Step 3: Sort by date (ascending) and weight (descending)
    .sort(c.date, c.sp500_weight, descending=[False, True])
    # Step 4: For each day, select the top 10 stocks by weight
    .group_by(c.date, maintain_order=True)
    .agg(
        # Get ticker, permno, and weight for the top 10 stocks
        ticker=c.ticker.head(10),
        permno=c.permno.head(10),
        sp500_weight=c.sp500_weight.head(10)
    )
    # Step 5: Explode to convert lists back to individual rows
    .explode(['ticker', 'permno', 'sp500_weight'])
    .head(20)  # Show first 20 rows (top 10 for first 2 days)
)

Execution Explained:

Step 1: We filter the data to include only S&P 500 constituent stocks using .filter(c.is_sp500 == True). This ensures we’re working only with stocks that are actually in the index.
Step 2: We calculate each stock’s weight within the S&P 500 using the formula:
$\text{weight}_i = \frac{\text{market cap}_i}{\sum_{j \in \text{S\&P 500}} \text{market cap}_j}$
We use .over('date') to compute the total S&P 500 market cap for each day separately. This gives us the proportion of the index that each stock represents.
Step 3: We sort by date (ascending, so chronological order) and sp500_weight (descending, so largest weights first). This prepares the data for selecting the top 10 on each day.
Step 4: We group by date with maintain_order=True to preserve the sorting from Step 3. Then we use .head(10) to select the top 10 stocks for each day. We extract:
- ticker: the stock ticker symbol
- permno: the permanent stock identifier
- sp500_weight: the calculated weight
At this point, each date has a single row with three list columns containing 10 values each.
Step 5: We use .explode() to convert the lists back into individual rows. Each of the 10 stocks gets its own row, making the output easier to read. We show the first 20 rows (top 10 stocks for the first 2 days).

The output reveals which companies dominate the index and how this composition changes over time.

Interpreting the Weights:

The sp500_weight column shows the percentage of the index represented by each stock. For example:

A weight of 0.07 means the stock comprises 7% of the S&P 500
If this stock rises 10%, it contributes +0.7% to the index return
The top 10 stocks’ weights typically sum to 25-30% of the total index

You can also compute the total weight of the top 10 for each day:

(
    data
    .filter(c.is_sp500 == True)
    .with_columns(sp500_weight=c.cap / c.cap.sum().over(c.date))
    .sort(c.date, c.sp500_weight, descending=[False, True])
    .group_by(c.date, maintain_order=True
    .agg(
        top10_total_weight=c.sp500_weight.head(10).sum()
    )
    .head(10)
)

This shows how concentrated the index has become over time.

Which stock has the highest trading volume in each industry on each day?¶

Why This Matters:

Identifying the most heavily traded stock in each industry reveals market leadership and investor attention. High trading volume typically signals strong investor interest - whether bullish or bearish - and often indicates stocks that are price discoverers for their sector. These volume leaders tend to have the most liquid options markets, tightest bid-ask spreads, and serve as bellwethers for sector sentiment. For traders, these are the stocks where large positions can be established or unwound with minimal market impact. For analysts, tracking which stocks dominate intra-industry volume can reveal shifts in competitive dynamics (e.g., Tesla surpassing GM in auto trading volume signaled a fundamental shift in investor focus from traditional to electric vehicles).

(
    data
    # Step 1: Find the maximum trading volume for each (industry, date) combination
    # This creates a new column with the max value broadcast to all rows in the group
    .with_columns(
        max_volume=c.prcvol.max().over(c.industry, c.date)
    )
    # Step 2: Filter to keep only the stock(s) with the maximum volume in each group
    .filter(c.prcvol == c.max_volume)
    # Step 3: Select relevant columns for the output
    .select([
        c.date,
        c.industry,
        c.ticker,
        c.permno,
        c.prcvol.alias('trading_volume')
    ])
    # Step 4: Sort by date and industry for readability
    .sort(c.date, c.industry)
    .head(3)  # Show first 3 rows
)

Execution Explained:

Step 1: We use .over(['industry', 'date']) to compute the maximum trading volume (prcvol) within each (industry, date) group. This creates a new column max_volume where each row gets the max value for its group. For example, all stocks in the “BANKS” industry on 2012-01-04 will have the same max_volume value - the highest trading volume among all bank stocks on that day.
Step 2: We filter the data to keep only rows where prcvol == max_volume. This selects the stock(s) with the highest trading volume in each industry on each day. Note that if multiple stocks tie for the highest volume (rare but possible), all will be included.
Step 3: We select the relevant columns for a clean output: date, industry, ticker (stock symbol), permno (permanent identifier), and trading_volume (the actual volume value, renamed for clarity).
Step 4: We sort by date and industry to make the results easy to scan chronologically and alphabetically by sector. The .head(20) limits the output to the first 20 rows for display purposes.

The result shows which stock was most actively traded in each industry on each day, revealing where investor attention was focused within each sector.

Analyzing a Specific Industry:

If you want to track the volume leader for a specific industry over time (e.g., technology), you can add a filter:

(
    data
    # Focus on a specific industry
    .filter(c.industry == 'SOFTWARE')
    # Step 1: Find max volume for each date in this industry
    .with_columns(max_volume=c.prcvol.max().over(c.date))
    # Step 2: Keep only the volume leader
    .filter(c.prcvol == c.max_volume)
    # Step 3: Select columns
    .select([c.date, c.ticker, c.permno, c.prcvol.alias('trading_volume')])
    .sort(c.date)
    .head(10)
)

This shows how volume leadership evolves within a single industry, which can reveal changing investor preferences or competitive dynamics over time.

What is the ratio of top 10% to bottom 10% trading volume within each industry on each day?¶

Why This Matters:

The 90th-to-10th percentile volume ratio within an industry measures dispersion in investor attention across stocks in the same sector, providing a more robust metric than max/min which can be distorted by extreme outliers. A high ratio (e.g., 50x) indicates significant concentration where highly-traded stocks receive far more attention than the typical lower-volume stocks, suggesting winners-take-all dynamics or a “halo effect” around market leaders. Conversely, a low ratio (e.g., 3x) indicates more balanced attention across the industry. Using percentiles instead of absolute extremes gives a more stable measure that’s less sensitive to occasional zero-volume days or unusual spikes from corporate events.

(
    data
    # Step 1: Group by date and industry to compute statistics within each group
    .group_by(c.date, c.industry)
    .agg(
        # Step 2: Calculate the 90th percentile (top 10%) and 10th percentile (bottom 10%)
        # Then compute their ratio for a robust measure of concentration
        p90=c.prcvol.quantile(0.90),
        p10=c.prcvol.quantile(0.10),
    )
    # Step 3: Calculate the ratio of 90th to 10th percentile
    .with_columns(
        volume_ratio=(c.p90 / c.p10)
    )
    # Step 4: Sort by date and industry for readability
    .sort(c.date, c.industry)
    # Step 5: Filter out infinite and null values
    .filter(c.volume_ratio.is_finite())
    .head(3)  # Show first 3 rows
)

Execution Explained:

Step 1: We group by ['date', 'industry'] to compute metrics for each industry on each day. This allows us to compare stocks within the same industry on the same day.
Step 2: Within each group, we calculate two percentiles using .quantile():
- p90: The 90th percentile - 90% of stocks have volume below this value (representing “high volume” stocks)
- p10: The 10th percentile - 10% of stocks have volume below this value (representing “low volume” stocks)
These percentiles are more robust than max/min because they’re not affected by extreme outliers.
Step 3: We calculate volume_ratio = p90 / p10. For example, a ratio of 20 means that stocks in the top 10% by volume trade about 20 times as much as stocks in the bottom 10%.
Step 4: We sort by date and industry to organize the results chronologically and alphabetically by sector.
Step 5: We filter for finite values using .is_finite(), which removes both infinite and null values. This ensures we only show valid ratios.

High ratios indicate concentrated liquidity where highly-traded stocks receive far more attention, while low ratios suggest more balanced attention across the industry. The percentile-based approach provides a more stable metric than using absolute max/min values.

Why Percentiles Are More Robust:

Using the 90th/10th percentile ratio instead of max/min provides several advantages:

Outlier Resistance: A single stock with zero volume (market halt, delisting) would make max/min infinite, but percentiles remain stable
Corporate Event Immunity: Unusual spikes from earnings, M&A announcements, or short squeezes don’t distort the metric
Consistent Interpretation: The ratio represents a consistent segment of the distribution (top 10% vs bottom 10%), making comparisons across industries and time more meaningful
Better Trend Detection: Percentile-based metrics show gradual changes in concentration patterns that max/min ratios might obscure with noise.

What is the average return of high-volume stocks (above 80th percentile) within each industry on each day?¶

Why This Matters:

Trading volume often signals informed trading or strong conviction. Stocks with unusually high trading volume within their industry may be experiencing newsflow, analyst upgrades, or institutional accumulation/distribution. By comparing the returns of high-volume stocks (top 20% by volume) to the broader industry, we can test whether elevated trading activity predicts abnormal returns. This helps identify whether high volume is a bullish signal (smart money accumulating) or bearish (dumb money buying tops).

(
    data
    # Step 1: Calculate the 80th percentile of trading volume for each (date, industry) group
    .with_columns(
        # Get the 80th percentile volume within each industry on each day
        volume_p80=c.prcvol.quantile(0.8).over(c.date, c.industry)
    )
    # Step 2: Filter to keep only stocks with volume above the 80th percentile
    # These are the "high-volume" stocks within their industry
    .filter(c.prcvol > c.volume_p80)
    # Step 3: Group by date and industry to compute average return
    .group_by(c.date, c.industry)
    .agg(
        # Calculate the mean return of high-volume stocks
        # We use retx which is the daily return already calculated in the dataset
        avg_return=c.retx.mean()
    )
    # Step 4: Sort by date and industry for readability
    .sort(c.date, c.industry)
    .head(3)  # Show first 3 rows
)

Execution Explained:

Step 1: We use .with_columns() to add a new column volume_p80 that contains the 80th percentile of trading volume (prcvol) for each (date, industry) combination. The .over(['date', 'industry']) ensures the percentile is calculated separately for each industry on each day. This value is broadcast to all rows in the group.
Step 2: We filter to keep only stocks where prcvol > volume_p80, meaning we select only the top 20% of stocks by trading volume within each industry on each day. These are the “high-volume” stocks that are attracting more attention than their industry peers.
Step 3: We group by ['date', 'industry'] and calculate the average return (retx.mean()) of these high-volume stocks. This gives us one average return value for each industry on each day, representing the performance of the most-traded stocks in that sector.
Step 4: We sort by date and industry for easy reading and display the first 3 rows.

The result shows whether stocks with elevated trading volume (top 20%) in each industry tend to have positive or negative returns, which can reveal volume-return dynamics specific to each sector.

Comparing High-Volume vs Low-Volume Returns:

To test whether high-volume stocks outperform or underperform, you can compare them to low-volume stocks:

(
    data
    # Step 1: Calculate volume percentiles
    .with_columns(
        volume_p80=c.prcvol.quantile(0.8).over(c.date, c.industry),
        volume_p20=c.prcvol.quantile(0.2).over(c.date, c.industry),
        volume_group=pl.when(c.prcvol >= c.prcvol.quantile(0.8).over(c.date, c.industry))
            .then(pl.lit('high_volume'))
            .when(c.prcvol <= c.prcvol.quantile(0.2).over(c.date, c.industry))
            .then(pl.lit('low_volume'))
            .otherwise(pl.lit('medium'))
    )
    # Step 2: Filter to high and low volume groups
    .filter(c.volume_group.is_in(['high_volume', 'low_volume']))
    # Step 3: Calculate average returns by group
    .group_by(c.date, c.industry, c.volume_group)
    .agg(avg_return=c.retx.mean())
    # Step 4: Pivot to compare side-by-side
    .pivot(index=['date', 'industry'], on='volume_group', values='avg_return')
    .head(5)
)

This shows whether high-volume stocks (top 20%) systematically outperform or underperform low-volume stocks (bottom 20%) within each industry, revealing whether trading activity is a signal of informed or uninformed trading.

What is the correlation between average returns of high-volume stocks (top 10%) and low-volume stocks (bottom 10%) across days?¶

Why This Matters:

This question tests the “liquidity premium” hypothesis and market segmentation. If high-volume stocks and low-volume stocks move together (high positive correlation), it suggests broad market-wide forces dominate (systematic risk). If they move independently or inversely (low or negative correlation), it suggests that high-volume stocks (typically large-caps with institutional interest) and low-volume stocks (typically small-caps or neglected names) respond to different factors - perhaps smart money rotates between them, or they represent different risk exposures.

# Step 1: Calculate the 90th and 10th percentile of trading volume for each day
# and classify stocks into high-volume (top 10%) or low-volume (bottom 10%) groups
data_with_volume_groups = (
    data
    .with_columns(
        # Get the 90th percentile (top 10% threshold) for each day
        volume_p90=c.prcvol.quantile(0.9).over(c.date),
        # Get the 10th percentile (bottom 10% threshold) for each day
        volume_p10=c.prcvol.quantile(0.1).over(c.date)
    )
    # Create a volume group indicator
    .with_columns(
        volume_group=pl.when(c.prcvol >= c.volume_p90)
            .then(pl.lit('high_volume'))
            .when(c.prcvol <= c.volume_p10)
            .then(pl.lit('low_volume'))
            .otherwise(pl.lit('medium'))
    )
)

# Step 2: Calculate the average return for each volume group on each day
daily_returns_by_volume = (
    data_with_volume_groups
    # Filter to only high-volume and low-volume groups
    .filter(c.volume_group.is_in(['high_volume', 'low_volume']))
    # Group by date and volume group
    .group_by(c.date, c.volume_group)
    .agg(
        # Calculate average return for each group
        avg_return=c.retx.mean()
    )
    # Pivot to get one column for each volume group
    .pivot(index='date', on='volume_group', values='avg_return')
    .sort(c.date)
)

# Show first 3 days
daily_returns_by_volume.head(3)

Output Explained:

The table shows the average daily return for two groups of stocks:

high_volume: The average return of the top 10% most-traded stocks each day
low_volume: The average return of the bottom 10% least-traded stocks each day

Each row represents one trading day, allowing us to compare how these two groups performed on the same day.

# Step 3: Calculate the correlation between high-volume and low-volume returns
correlation = daily_returns_by_volume.select(
    pl.corr('high_volume', 'low_volume').alias('correlation')
)

correlation

Execution Explained:

Step 1: We identify the top 10% and bottom 10% of stocks by trading volume on each day using .quantile(0.9) and .quantile(0.1). The .over(c.date) ensures these percentiles are calculated separately for each day. We then classify each stock into 'high_volume', 'low_volume', or 'medium' groups using conditional logic.
Step 2: We filter to keep only the high-volume and low-volume groups, then group by ['date', 'volume_group'] to calculate the average return (retx.mean()) for each group on each day. The .pivot() operation reshapes the data so that each volume group becomes a separate column, making it easy to compare them side-by-side.
Step 3: We calculate the Pearson correlation coefficient between the high_volume and low_volume columns across all days. This single number tells us how strongly the returns of these two groups move together over time.

Chained Version:

For those who prefer a more compact approach, here’s the same analysis in a single chain:

(
    data
    # Step 1: Calculate volume percentiles and classify stocks
    .with_columns(
        volume_p90=c.prcvol.quantile(0.9).over(c.date),
        volume_p10=c.prcvol.quantile(0.1).over(c.date)
    )
    .with_columns(
        volume_group=pl.when(c.prcvol >= c.volume_p90)
            .then(pl.lit('high_volume'))
            .when(c.prcvol <= c.volume_p10)
            .then(pl.lit('low_volume'))
            .otherwise(pl.lit('medium'))
    )
    # Step 2: Calculate average returns by volume group
    .filter(c.volume_group.is_in(['high_volume', 'low_volume']))
    .group_by(c.date, c.volume_group)
    .agg(avg_return=c.retx.mean())
    .pivot(index='date', on='volume_group', values='avg_return')
    # Step 3: Calculate correlation
    .select(pl.corr('high_volume', 'low_volume').alias('correlation'))
)

Which industries have above-market trading volume on each day?¶

Why This Matters:

Industries with above-average trading volume signal heightened investor attention and potential sector rotation. This can indicate newsflow (earnings, regulatory changes, macroeconomic events), momentum shifts, or institutional rebalancing. Identifying “hot sectors” helps traders spot where capital is flowing.

(
    data
    # Step 1: Calculate two averages for comparison
    .with_columns(
        # Market-wide average: mean trading volume across ALL stocks on each day
        volume_market_avg=c.prcvol.mean().over(c.date),
        # Industry average: mean trading volume within each industry on each day
        volume_industry_avg=c.prcvol.mean().over(c.date, c.industry)
    )
    # Step 2: Filter to keep only industries where the industry average exceeds market average
    # This identifies "hot" industries with above-average trading activity
    .filter(c.volume_industry_avg > c.volume_market_avg)
    # Step 3: Select only the relevant columns
    .select(c.date, c.industry, c.volume_industry_avg, c.volume_market_avg)
    # Step 4: Remove duplicate (date, industry) combinations
    # since multiple stocks in the same industry will have the same values
    .unique()
    # Step 5: Sort by date and industry for readability
    .sort(c.date, c.industry)
    .head(3)  # Show first 3 rows
)

Execution Explained:

Step 1: We calculate two different averages using .over():
- volume_market_avg: The average trading volume (prcvol) across all stocks on each day. This value is the same for every stock on the same day.
- volume_industry_avg: The average trading volume within each industry on each day. This value is the same for all stocks in the same industry on the same day, but differs across industries.
Step 2: We filter to keep only rows where volume_industry_avg > volume_market_avg. This means we’re selecting industries that are trading more actively (per stock) than the overall market average.
Step 3-4: Since every stock in an industry will have the same volume_industry_avg value, we use .select() to keep only date and industry columns, then .unique() to remove duplicates and get one row per (date, industry) combination.
Step 5: We sort by date and industry to make the output easy to read, showing which industries were “hot” (above-average volume) on each day.

Extended Analysis: Volume Ratio

To quantify how much more active each industry is, we can calculate the ratio of industry volume to market volume:

(
    data
    # Step 1: Calculate market and industry averages
    .with_columns(
        volume_market_avg=c.prcvol.mean().over(c.date),
        volume_industry_avg=c.prcvol.mean().over(c.date, c.industry)
    )
    # Step 2: Calculate the ratio to show how much more active each industry is
    .with_columns(
        volume_ratio=(c.volume_industry_avg / c.volume_market_avg)
    )
    # Step 3: Keep only above-average industries
    .filter(c.volume_ratio > 1.0)
    # Step 4: Get unique (date, industry) pairs with their ratio
    .select([c.date, c.industry, c.volume_ratio])
    .unique()
    # Step 5: Sort by date and ratio (highest first) to see the "hottest" industries
    .sort(c.date, c.volume_ratio, descending=[False, True])
    .head(10)  # Show top 10
)

This extended version calculates volume_ratio = industry_avg / market_avg. A ratio of:

2.0 means the industry trades at 2x the market average volume per stock
1.5 means 50% more active than average
1.0 means exactly at market average
< 1.0 means below-average activity

By sorting by volume_ratio in descending order, we identify not just which industries are above average, but which are the most above average - the true market leaders in terms of trading activity.

What is the excess return (alpha) of each stock relative to the market?¶

Why This Matters:

Excess return (or “alpha”) measures a stock’s performance beyond what would be expected given its exposure to the market. This is the core concept in the Capital Asset Pricing Model (CAPM) and is fundamental to active portfolio management. By regressing each stock’s return against the market return, we decompose returns into two components: (1) systematic return explained by market movements (beta × market return), and (2) idiosyncratic return not explained by the market (alpha). Positive alpha suggests a stock outperformed the market after adjusting for risk, while negative alpha suggests underperformance. Alpha is what active managers are paid to generate - returns that can’t be achieved simply by holding the market index.

import polars_ds as pds

(
    data
    # Step 1: Calculate stock returns and market-cap weights for each stock on each day
    .with_columns(
        # Stock return: we use the existing retx column
        stock_return=c.retx,
        # Weight: each stock's market cap divided by total market cap on that day
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate the value-weighted market return for each day
    .with_columns(
        # Market return = weighted average of all stock returns
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Step 3: Filter to keep only stocks with sufficient observations
    # We need at least 30 observations for meaningful regression results
    .with_columns(
        num_obs=c.permno.count().over(c.permno)
    )
    .filter(c.num_obs >= 30)
    # Step 4: Run linear regression for each stock: stock_return ~ market_return
    # This regresses each stock's returns against the market return across all dates
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.stock_return,
        c.market_return,
        # Use polars_ds to run linear regression for each stock (grouped by permno)
        pds.lin_reg(
            c.market_return,              # Independent variable (X)
            target=c.stock_return,        # Dependent variable (Y)
            add_bias=True,                # Include intercept (alpha)
            return_pred=True              # Return predictions and residuals
        )
        .over(c.permno)                   # Run separate regression for each stock
        .alias("regression_results")      # Name the results column
    )
    # Step 5: Unnest the regression results to extract coefficients and residuals
    .unnest("regression_results")
    # Step 6: Select the relevant columns
    # The "resid" column contains the excess returns (alpha)
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.resid.alias("excess_return")    # Residual = alpha (excess return)
    )
    .head(3)  # Show first 3 rows
)

Execution Explained:

Step 1: We calculate each stock’s return (using the pre-calculated retx column) and its weight in the market. The weight is the stock’s market capitalization divided by the total market cap on that day. This is broadcast to all stocks on each day using .over('date').
Step 2: We calculate the value-weighted market return by taking the sum of (stock_return × stock_weight) for all stocks on each day. This represents the overall market performance, weighted by market cap (similar to the S&P 500 index calculation).
Step 3: We use pds.lin_reg() to run a linear regression for each stock. The regression equation is:
```
stock_return = alpha + beta × market_return + error
```
- The beta coefficient captures the stock’s sensitivity to market movements (systematic risk)
- The alpha (intercept) captures the average excess return not explained by the market
- The residuals (errors) are the day-by-day excess returns (actual return - predicted return)
The .over(c.permno) ensures we run a separate regression for each stock across all its historical dates.
Step 4: The unnest("regression_results") extracts the structured output from lin_reg(), which includes columns like pred (predicted returns), resid (residuals), and potentially coefficients.
Step 5: We select the resid column and rename it to excess_return. These residuals represent the portion of each stock’s return that cannot be explained by market movements - this is the stock’s alpha or idiosyncratic return on each day.

Interpreting the Results:

Positive excess_return: The stock outperformed what would be expected given the market’s movement that day
Negative excess_return: The stock underperformed relative to its expected return based on the market
Values close to 0: The stock moved in line with the market, after accounting for its beta

Extended Analysis: Extracting Beta and Alpha Coefficients

The residuals give us day-by-day excess returns, but we can also extract the regression coefficients (beta and alpha) for each stock:

# Extract beta and alpha coefficients for each stock
stock_betas = (
    data
    # Step 1: Calculate returns and weights
    .with_columns(
        stock_return=c.retx,
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate market return
    .with_columns(
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Step 3: Filter to keep only stocks with sufficient observations
    # We need at least 30 observations for meaningful regression results
    .with_columns(
        num_obs=c.permno.count().over(c.permno)
    )
    .filter(c.num_obs >= 30)
    # Step 4: Run regression and extract coefficients
    .group_by(c.permno)
    .agg(
        ticker=c.ticker.first(),
        num_obs=c.num_obs.first(),
        # Run regression to get coefficients (returns a list: [intercept, slope])
        coef_list=pds.lin_reg(
            c.market_return,
            target=c.stock_return,
            add_bias=True,
            return_pred=False  # We only need coefficients, not predictions
        ).first()
    )
    # Step 5: Extract beta (slope) and alpha (intercept) from the coefficient list
    .with_columns(
        alpha=c.coef_list.list.get(0),      # Alpha is the intercept (first element)
        beta=c.coef_list.list.get(1)        # Beta is the slope coefficient (second element)
    )
    .select(
        c.permno,
        c.ticker,
        c.num_obs,
        c.beta,
        c.alpha
    )
)

stock_betas.head(10)

Understanding Beta and Alpha:

Beta: Measures the stock’s sensitivity to market movements
- Beta = 1.0: Stock moves in line with the market (e.g., if market goes up 1%, stock goes up 1%)
- Beta > 1.0: Stock is more volatile than the market (amplifies market moves)
- Beta < 1.0: Stock is less volatile than the market (defensive stock)
- Beta < 0: Stock moves opposite to the market (rare, e.g., inverse ETFs or gold stocks)
Alpha: Measures the stock’s average excess return (annualized)
- Alpha > 0: Stock consistently outperforms the market after adjusting for risk
- Alpha < 0: Stock consistently underperforms
- Alpha ≈ 0: Stock performs as expected given its beta

Important Note on Data Requirements:

The code filters out stocks with fewer than 30 observations because linear regression requires sufficient data to produce reliable estimates. With too few observations, the regression is either impossible (if observations < features) or produces unstable coefficients with large standard errors. The 30-observation threshold is a practical minimum - in real applications, you’d typically want 60+ observations (e.g., 5 years of monthly data or 3 months of daily data).

Applications:

Portfolio construction: Combine high-alpha stocks to beat the market
Risk management: Use beta to understand portfolio sensitivity to market crashes
Performance attribution: Separate skill (alpha) from luck (beta exposure)

What is the excess return (alpha) of each stock relative to its industry?¶

Why This Matters:

While market-adjusted returns (previous question) show stock performance relative to the overall market, industry-adjusted returns reveal whether a stock is outperforming or underperforming its industry peers. This is crucial because stocks within the same industry tend to move together due to common factors (regulatory changes, commodity prices, demand shifts). By regressing each stock’s return against its industry return, we isolate stock-specific performance from sector-wide movements. This helps identify the best and worst performers within each sector, which is essential for sector-neutral strategies, stock picking within industries, and understanding whether a stock’s gains are due to industry tailwinds or genuine company outperformance.

import polars_ds as pds

(
    data
    # Step 1: Calculate stock returns and industry-level market-cap weights
    .with_columns(
        # Stock return: we use the existing retx column
        stock_return=c.retx,
        # Weight: each stock's market cap divided by total industry market cap on that day
        stock_weight_in_industry=(c.cap / c.cap.sum().over(c.date, c.industry))
    )
    # Step 2: Calculate the value-weighted industry return for each (date, industry)
    .with_columns(
        # Industry return = weighted average of all stock returns within that industry
        industry_return=(c.stock_return * c.stock_weight_in_industry).sum().over(c.date, c.industry)
    )
    # Step 3: Filter to keep only stocks with sufficient observations
    # We need at least 30 observations for meaningful regression results
    .with_columns(
        num_obs=c.permno.count().over(c.permno)
    )
    .filter(c.num_obs >= 30)
    # Step 4: Run linear regression for each stock: stock_return ~ industry_return
    # This regresses each stock's returns against its industry return across all dates
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.industry,
        c.stock_return,
        c.industry_return,
        # Use polars_ds to run linear regression for each stock (grouped by permno)
        pds.lin_reg(
            c.industry_return,            # Independent variable (X)
            target=c.stock_return,        # Dependent variable (Y)
            add_bias=True,                # Include intercept (alpha)
            return_pred=True              # Return predictions and residuals
        )
        .over(c.permno)                   # Run separate regression for each stock
        .alias("regression_results")      # Name the results column
    )
    # Step 5: Unnest the regression results to extract coefficients and residuals
    .unnest("regression_results")
    # Step 6: Select the relevant columns
    # The "resid" column contains the industry-adjusted excess returns (alpha)
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.industry,
        c.resid.alias("excess_return_vs_industry")    # Residual = industry-adjusted alpha
    )
    .head(3)  # Show first 3 rows
)

Execution Explained:

Step 1: We calculate each stock’s return (using the pre-calculated retx column) and its weight within its industry. The weight is the stock’s market cap divided by the total industry market cap on that day (not the overall market cap). This ensures we’re comparing the stock to its industry peers.
Step 2: We calculate the value-weighted industry return by taking the sum of (stock_return × stock_weight_in_industry) for all stocks in each industry on each day. This represents the performance of the industry as a whole, weighted by market cap.
Step 3: We filter to keep only stocks with at least 30 observations to ensure stable regression results.
Step 4: We use pds.lin_reg() to run a linear regression for each stock. The regression equation is:
```
stock_return = alpha + beta × industry_return + error
```
- The beta coefficient measures the stock’s sensitivity to its industry (industry beta)
- The alpha (intercept) captures the average excess return relative to the industry
- The residuals (errors) are the day-by-day industry-adjusted returns
The .over(c.permno) ensures we run a separate regression for each stock across all its historical dates.
Step 5-6: We extract the residuals, which represent the portion of each stock’s return that cannot be explained by industry movements - this is the stock’s industry-adjusted alpha.

Interpreting the Results:

Positive excess_return_vs_industry: The stock outperformed its industry peers that day
Negative excess_return_vs_industry: The stock underperformed relative to its industry
Values close to 0: The stock moved in line with its industry

Market-Adjusted vs. Industry-Adjusted Returns:

Understanding the difference between these two measures is crucial for investment analysis:

Metric	What it measures	Use case
Market-adjusted return	Stock performance vs. overall market	Identify stocks that beat/lag the broad market
Industry-adjusted return	Stock performance vs. industry peers	Identify best/worst performers within each sector

Key Insight: A stock can have:

Positive market-adjusted return but negative industry-adjusted return → Stock beat the market, but underperformed its industry (the industry itself is booming)
Negative market-adjusted return but positive industry-adjusted return → Stock lagged the market, but outperformed its struggling industry (best in a bad sector)

Applications:

Sector-neutral strategies: Build portfolios that are neutral to sector exposures but long the best stocks and short the worst within each sector
Stock selection: Identify companies with genuine competitive advantages, not just riding industry tailwinds
Performance attribution: Separate sector allocation effects from stock selection effects in portfolio performance

What is the correlation between market-adjusted and industry-adjusted excess returns for each stock?¶

Why This Matters:

The correlation between market-adjusted and industry-adjusted excess returns reveals important information about a stock’s return structure. If these two measures are highly correlated, it means that when a stock outperforms the market, it also tends to outperform its industry (and vice versa) - suggesting the stock has consistent stock-specific factors driving its performance across different benchmarks. For example, a stock might beat its industry but lag the overall market when its industry is struggling. This correlation helps investors understand whether a stock’s alpha is robust across different reference points and informs how to construct hedging strategies (should you hedge market risk, industry risk, or both?).

import polars_ds as pds

# Step 1: Calculate market-adjusted excess returns for each stock
data_mkt = (
    data
    # Calculate stock returns and market-cap weights
    .with_columns(
        stock_return=c.retx,
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Calculate value-weighted market return
    .with_columns(
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Filter stocks with sufficient observations
    .with_columns(num_obs=c.permno.count().over(c.permno))
    .filter(c.num_obs >= 30)
    # Run regression to get market-adjusted residuals
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.stock_return,
        c.market_return,
        pds.lin_reg(
            c.market_return,
            target=c.stock_return,
            add_bias=True,
            return_pred=True
        )
        .over(c.permno)
        .alias("regression_results")
    )
    .unnest("regression_results")
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.resid.alias("market_adjusted_alpha")
    )
)

# Step 2: Calculate industry-adjusted excess returns for each stock
data_ind = (
    data
    # Calculate stock returns and industry-level market-cap weights
    .with_columns(
        stock_return=c.retx,
        stock_weight_in_industry=(c.cap / c.cap.sum().over(c.date, c.industry))
    )
    # Calculate value-weighted industry return
    .with_columns(
        industry_return=(c.stock_return * c.stock_weight_in_industry).sum().over(c.date, c.industry)
    )
    # Filter stocks with sufficient observations
    .with_columns(num_obs=c.permno.count().over(c.permno))
    .filter(c.num_obs >= 30)
    # Run regression to get industry-adjusted residuals
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.industry,
        c.stock_return,
        c.industry_return,
        pds.lin_reg(
            c.industry_return,
            target=c.stock_return,
            add_bias=True,
            return_pred=True
        )
        .over(c.permno)
        .alias("regression_results")
    )
    .unnest("regression_results")
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.industry,
        c.resid.alias("industry_adjusted_alpha")
    )
)

# Step 3: Join the two datasets and calculate correlation for each stock
correlation_results = (
    data_mkt.join(
        data_ind,
        on=['permno', 'ticker', 'date'],
        how='inner'
    )
    # Calculate correlation between market-adjusted and industry-adjusted alphas
    .group_by(c.permno, c.ticker)
    .agg(
        industry=c.industry.first(),
        correlation=pl.corr(c.market_adjusted_alpha, c.industry_adjusted_alpha)
    )
    .sort(c.correlation)
    .drop_nans()
    .head(5)  # Show first 5 stocks
)

correlation_results

Execution Explained:

This analysis involves three major steps:

Step 1: Calculate Market-Adjusted Excess Returns

We compute each stock’s return and its weight in the overall market
Calculate the value-weighted market return for each day
Filter stocks with at least 30 observations
Run linear regression: stock_return ~ market_return for each stock
Extract the residuals, which represent market-adjusted alphas (excess returns relative to the market)

Step 2: Calculate Industry-Adjusted Excess Returns

We compute each stock’s return and its weight within its industry
Calculate the value-weighted industry return for each day
Filter stocks with at least 30 observations
Run linear regression: stock_return ~ industry_return for each stock
Extract the residuals, which represent industry-adjusted alphas (excess returns relative to the industry)

Step 3: Calculate Correlation

Join the two datasets on permno, ticker, and date to align the daily alphas
For each stock, calculate the correlation between its market-adjusted alphas and industry-adjusted alphas across all days
This correlation tells us how consistently the two measures move together for each stock

Interpreting the Correlation:

High positive correlation (close to 1): The stock’s market-adjusted and industry-adjusted alphas move together. When it beats the market, it also beats its industry. This suggests robust stock-specific factors driving performance.
Low correlation (close to 0): The two measures are independent. A stock might outperform the market but underperform its industry (or vice versa), depending on industry vs. market dynamics.
Negative correlation: Rare, but would suggest that when a stock beats the market, it tends to underperform its industry, possibly indicating the stock benefits more from broad market rallies than industry-specific trends.

Practical Applications:

Understanding this correlation helps with:

Hedging Strategy Design:
- High correlation → Consider hedging at the industry level (industry ETFs) rather than market level, as the stock’s performance is more industry-specific
- Low correlation → May need to hedge both market and industry risks independently
Portfolio Construction:
- Stocks with low correlation provide more diversification benefits when combined in a portfolio, as their alphas respond differently to market vs. industry factors
- Stocks with high correlation offer more consistent alpha generation across different market regimes
Risk Decomposition:
- The correlation helps decompose a stock’s idiosyncratic risk into:
  - Market-specific idiosyncratic risk: Captured by market-adjusted alpha
  - Industry-specific idiosyncratic risk: Captured by industry-adjusted alpha
  - The correlation reveals how these two components interact

How many stocks on each day have had 3 consecutive days of gains or losses?¶

Why This Matters:

Consecutive winning or losing streaks are a key indicator of momentum and reversal patterns in markets. Stocks with 3+ consecutive up days may be experiencing momentum (trend continuation) driven by positive news, earnings beats, or investor sentiment, while stocks with 3+ consecutive down days might signal sell-offs or negative sentiment. By tracking how many stocks are on streaks each day, we can gauge overall market sentiment and identify potential turning points. High numbers of stocks on winning streaks suggest bullish momentum, while high numbers on losing streaks indicate bearish pressure. This metric is valuable for momentum traders, contrarian strategies, and understanding market regime shifts.

(
    data
    # Step 1: Create binary indicators for up/down days based on returns
    .sort(c.permno, c.date)  # Ensure proper ordering for each stock
    .with_columns(
        # up = 1 if positive return, 0 otherwise
        up=pl.when(c.retx > 0).then(1).otherwise(0),
        # down = 1 if negative return, 0 otherwise
        down=pl.when(c.retx < 0).then(1).otherwise(0)
    )
    # Step 2: Calculate rolling 3-day sum for each stock
    # This gives us the count of up/down days in the last 3 days
    .with_columns(
        # Sum of up days in the last 3 trading days (including today)
        up_streak_3=c.up.rolling_sum(window_size=3).over(c.permno),
        # Sum of down days in the last 3 trading days (including today)
        down_streak_3=c.down.rolling_sum(window_size=3).over(c.permno)
    )
    # Step 3: For each date, count how many stocks have exactly 3 consecutive up/down days
    .group_by(c.date)
    .agg(
        # Count stocks where all 3 days were up (up_streak_3 == 3)
        num_3day_gainers=(c.up_streak_3 == 3).sum(),
        # Count stocks where all 3 days were down (down_streak_3 == 3)
        num_3day_losers=(c.down_streak_3 == 3).sum()
    )
    # Step 4: Sort by date and display results
    .sort(c.date)
    [10:13]
)

Execution Explained:

Step 1: We create binary indicators (up and down) for each stock-day observation. If retx > 0, the stock had a positive return that day (up = 1), otherwise up = 0. Similarly for down when retx < 0. We sort by permno and date to ensure proper time-series ordering for each stock.
Step 2: We use .rolling_sum(window_size=3) to calculate the sum of up and down indicators over the last 3 trading days (including the current day) for each stock. The .over(c.permno) ensures the rolling window is calculated separately for each stock. If up_streak_3 == 3, it means all 3 of the last 3 days were up days for that stock.
Step 3: We group by date and count how many stocks have up_streak_3 == 3 (3 consecutive gains) and down_streak_3 == 3 (3 consecutive losses) on each day. The .sum() operation counts the number of True values in each boolean expression.
Step 4: We sort by date and display the first 3 days.

Interpreting the Results:

num_3day_gainers: The number of stocks that have risen for 3 consecutive trading days as of this date
num_3day_losers: The number of stocks that have fallen for 3 consecutive trading days as of this date

Note: Early dates may show low or zero counts because stocks need at least 3 days of history to qualify for a 3-day streak.

Extended Analysis: Momentum Breadth Indicator

Market breadth measures how many stocks are participating in a market move. It answers the question: “Is the market rally (or decline) broad-based, or driven by just a few large stocks?” Strong breadth means many stocks are moving in the same direction, indicating healthy, sustainable trends. Weak breadth (e.g., indices up but most stocks down) suggests the move is fragile and may reverse. Breadth is considered a leading indicator - divergences between price and breadth often precede trend changes.

We can create a “momentum breadth” indicator by comparing the number of gainers vs. losers:

(
    data
    # Step 1-2: Same as before
    .sort(c.permno, c.date)
    .with_columns(
        up=pl.when(c.retx > 0).then(1).otherwise(0),
        down=pl.when(c.retx < 0).then(1).otherwise(0)
    )
    .with_columns(
        up_streak_3=c.up.rolling_sum(window_size=3).over(c.permno),
        down_streak_3=c.down.rolling_sum(window_size=3).over(c.permno)
    )
    # Step 3: Calculate breadth metrics
    .group_by(c.date)
    .agg(
        num_3day_gainers=(c.up_streak_3 == 3).sum().cast(int),
        num_3day_losers=(c.down_streak_3 == 3).sum().cast(int)
    )
    # Step 4: Create breadth indicators
    .with_columns(
        # Ratio of gainers to losers (momentum strength)
        gainer_loser_ratio=(c.num_3day_gainers / c.num_3day_losers),
        # Net breadth (difference)
        net_breadth=(c.num_3day_gainers - c.num_3day_losers),
        # Breadth percentage (% of total that are gaining)
        breadth_pct=(c.num_3day_gainers / (c.num_3day_gainers + c.num_3day_losers))
    )
    .sort(c.date)
    [10:15]
)

Interpreting Breadth Indicators:

gainer_loser_ratio:
- > 1: More stocks on winning streaks than losing streaks (bullish momentum)
- < 1: More stocks on losing streaks (bearish momentum)
- >> 1 (e.g., 5): Very strong bullish momentum - much more participation in uptrends
net_breadth:
- Positive: More gainers than losers
- Negative: More losers than gainers
- Useful for spotting divergences (e.g., market indices up but breadth negative)
breadth_pct:
- > 0.5: Majority of streaking stocks are gaining (bullish)
- < 0.5: Majority of streaking stocks are losing (bearish)
- Close to 1.0: Almost all streaking stocks are gaining (very bullish)

Trading Applications:

Momentum strategies: Enter long when breadth is strong (high ratio), exit when it weakens
Contrarian strategies: Look for reversals when breadth reaches extremes (very high or very low ratios)
Market regime detection: Persistent high breadth suggests trending markets; low breadth suggests choppy, directionless markets

How many stocks on each day have outperformed the market for 3 consecutive days?¶

Why This Matters:

While previous question tracked absolute gains/losses, this question identifies stocks with relative outperformance - those consistently beating the market over multiple days. This is a more sophisticated momentum indicator because it isolates stock-specific strength from overall market movements. Stocks that outperform the market for 3+ consecutive days may have company-specific catalysts (earnings surprises, analyst upgrades, product launches) driving returns beyond what’s explained by market-wide movements. This metric helps identify true alpha generators and is particularly valuable for stock pickers and relative-value strategies. High counts suggest many stocks are generating idiosyncratic returns, while low counts indicate a more homogeneous, market-driven environment.

(
    data
    # Step 1: Calculate stock returns and market-cap weights
    .sort(c.permno, c.date)  # Ensure proper ordering for each stock
    .with_columns(
        # Stock return: use the existing retx column
        stock_return=c.retx,
        # Market-cap weight: each stock's weight in the overall market
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate the value-weighted market return for each day
    .with_columns(
        # Market return = weighted average of all stock returns
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Step 3: Create binary indicator for stocks that beat the market
    .with_columns(
        # beat_market = 1 if stock return > market return, 0 otherwise
        beat_market=pl.when(c.stock_return > c.market_return).then(1).otherwise(0)
    )
    # Step 4: Calculate rolling 3-day sum of beating the market
    .with_columns(
        # Count how many of the last 3 days the stock beat the market
        beat_market_streak_3=c.beat_market.rolling_sum(window_size=3).over(c.permno)
    )
    # Step 5: For each date, count stocks that beat the market all 3 days
    .group_by(c.date)
    .agg(
        # Count stocks where beat_market_streak_3 == 3 (outperformed all 3 days)
        num_3day_outperformers=(c.beat_market_streak_3 == 3).sum()
    )
    # Step 6: Sort by date and display results
    .sort(c.date)
    [10:13]
)

Execution Explained:

Step 1: We calculate each stock’s return (using retx) and its weight in the overall market based on market capitalization. We sort by permno and date to ensure proper time-series ordering for each stock.
Step 2: We calculate the value-weighted market return for each day. This is the benchmark against which we’ll compare each stock’s performance.
Step 3: We create a binary indicator beat_market that equals 1 if a stock’s return exceeds the market return on that day, and 0 otherwise.
Step 4: We use .rolling_sum(window_size=3) to count how many of the last 3 trading days each stock beat the market. If beat_market_streak_3 == 3, it means the stock outperformed the market on all 3 of the last 3 days.
Step 5: We group by date and count how many stocks have beat_market_streak_3 == 3 on each day. The .sum() operation counts the number of True values.
Step 6: We sort by date and display the first 3 days.

Interpreting the Results:

num_3day_outperformers: The number of stocks that have beaten the market for 3 consecutive trading days as of this date
High values suggest widespread stock-specific strength and dispersion in returns (good for stock pickers)
Low values suggest market returns are more homogeneous (harder to generate alpha through stock selection)
This differs from question 38’s absolute momentum by focusing on relative performance

Comparing Absolute vs. Relative Momentum:

Understanding the difference between absolute and relative momentum is crucial:

Metric	What it measures	Example
Absolute momentum (previous question)	Stocks with 3 consecutive positive returns	Stock up 0.5%, 0.3%, 0.2% = qualifies
Relative momentum (this question)	Stocks outperforming market for 3 days	Stock up 0.5% when market up 0.8% = does NOT qualify

Key Insights:

In bull markets, absolute momentum counts will be high, but relative momentum counts reveal which stocks are leading the rally
In bear markets, you can have low absolute momentum (few stocks rising) but still find relative outperformers (stocks falling less than the market)
Divergences are informative:
- High absolute momentum + Low relative momentum → Strong market rally, but few stocks beating it (narrow leadership)
- Low absolute momentum + High relative momentum → Weak market, but many stocks showing relative strength (defensive rotation)

This metric is essential for long-short strategies where you want to be long stocks with strong relative momentum regardless of absolute market direction.

How many stocks on each day have outperformed the market in at least 4 out of the last 5 trading days?¶

A Polars Challenge:

This question is more of a technical challenge to test your Polars skills than a standard financial analysis. It requires combining rolling window calculations, conditional logic, and aggregations - demonstrating how Polars handles complex multi-step transformations efficiently. The key challenge is tracking a “4 out of 5” condition using rolling sums and then aggregating across all stocks for each day.

(
    data
    # Step 1: Calculate stock returns and market-cap weights
    .sort(c.permno, c.date)  # Ensure proper ordering for each stock
    .with_columns(
        # Stock return: use the existing retx column
        stock_return=c.retx,
        # Market-cap weight: each stock's weight in the overall market
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate the value-weighted market return for each day
    .with_columns(
        # Market return = weighted average of all stock returns
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Step 3: Create binary indicator for stocks that beat the market
    .with_columns(
        # beat_market = 1 if stock return > market return, 0 otherwise
        beat_market=pl.when(c.stock_return > c.market_return).then(1).otherwise(0)
    )
    # Step 4: Calculate rolling 5-day sum of beating the market
    .with_columns(
        # Count how many of the last 5 days the stock beat the market
        beat_market_count_5=c.beat_market.rolling_sum(window_size=5).over(c.permno)
    )
    # Step 5: For each date, count stocks that beat the market at least 4 out of 5 days
    .group_by(c.date)
    .agg(
        # Count stocks where beat_market_count_5 >= 4 (at least 4 out of 5 days)
        num_4of5_outperformers=(c.beat_market_count_5 >= 4).sum()
    )
    # Step 6: Sort by date and display results
    .sort(c.date)
    [10:13]
)

Execution Explained:

Step 1-2: We calculate each stock’s return and the market-cap-weighted market return, just like in the previous question.
Step 3: We create a binary indicator beat_market that equals 1 when a stock outperforms the market, 0 otherwise.
Step 4: Here’s the key challenge - we use .rolling_sum(window_size=5) to count how many of the last 5 trading days each stock beat the market. This gives us a value between 0 and 5 for each stock-day.
Step 5: We filter to stocks where beat_market_count_5 >= 4, meaning the stock beat the market on at least 4 out of the last 5 days. Then we count how many such stocks exist on each date.
Step 6: Sort and display.

Interpreting the Results:

num_4of5_outperformers: The number of stocks that have beaten the market in at least 4 out of the last 5 trading days
This is a looser condition than the 3/3 requirement in question 39, so counts will typically be higher
Useful for identifying stocks with consistent but not perfect relative strength
The “4 out of 5” threshold allows for one off day while still capturing strong relative momentum

Which stocks in each month have returns exceeding twice the market’s monthly return?¶

Why This Matters:

Identifying stocks with outsized monthly returns relative to the market reveals potential “home run” opportunities and helps understand the distribution of extreme performance. Stocks that deliver returns exceeding 2x the market return in a single month often have catalysts like major acquisitions, FDA approvals, earnings blowouts, or short squeezes. While most stocks cluster around the market average, these outliers can significantly boost portfolio returns. Understanding their frequency and characteristics helps in constructing portfolios with positive skewness (asymmetric upside) and informs position sizing strategies for high-conviction trades.

(
    data
    # Step 1: Create year-month identifier and calculate daily market-cap weights
    .with_columns(
        # Extract year-month from date (e.g., 2012-01-04 -> "2012-01")
        year_month=c.date.cast(str).str.slice(0, 7),
        # Each stock's weight in the market on each day
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate daily market index and stock monthly returns
    .with_columns(
        # Create a value-weighted market index for each day
        market_index=(c.close * c.stock_weight).sum().over(c.date),
        # Calculate each stock's return for the entire month
        # (last close / first close - 1) within each month
        stock_monthly_return=(c.close.last() / c.close.first() - 1).over(c.permno, c.year_month)
    )
    # Step 3: Calculate market's monthly return
    .with_columns(
        # Market monthly return = (last market index / first market index - 1)
        # We use the same grouping structure for consistency
        market_monthly_return=(
            c.market_index.last().over(c.permno, c.year_month) / 
            c.market_index.first().over(c.permno, c.year_month) - 1
        )
    )
    # Step 4: Select relevant columns and remove duplicates
    .select(
        c.permno,
        c.ticker,
        c.year_month,
        c.stock_monthly_return,
        c.market_monthly_return
    )
    .unique()
    # Step 5: Filter for stocks with returns > 2x market return
    .filter(
        # Stock return must exceed twice the market return
        c.stock_monthly_return > 2 * c.market_monthly_return
    )
    # Step 6: Sort and display results
    .sort(c.year_month, c.permno)
    .head(3)  # Show first 3 results
)

Execution Explained:

Step 1: We create a year_month identifier by extracting the first 6 characters from the date (e.g., “2012-01-04” becomes “2012-01”). We also calculate daily market-cap weights.
Step 2: We create a value-weighted market index for each day (sum of close prices weighted by market cap). For each stock within each month, we calculate the monthly return as (last_close - first_close) / first_close.
Step 3: We calculate the market’s monthly return in the same way: (last_market_index - first_market_index) / first_market_index for each month.
Step 4: We select the relevant columns and use .unique() to remove duplicates (since each stock-month combination will have the same monthly return repeated across all days in that month).
Step 5: We filter to keep only stocks where stock_monthly_return > 2 * market_monthly_return. This identifies extreme outperformers.
Step 6: Sort by month and stock, then display the first 3 results.

Practice Exercise: Which stocks in each month have returns exceeding twice their industry’s monthly return?¶

Your Task:

This is similar to previous questions, but now compare each stock’s monthly return to its industry’s monthly return (instead of the market’s return). Your goal is to:

Calculate each stock’s monthly return
Calculate each industry’s monthly return (value-weighted by market cap)
Filter stocks where stock_monthly_return > 2 × industry_monthly_return
Return the results sorted by month and stock

Hints:

You’ll need to group by ['date', 'industry'] when calculating industry weights
The logic is very similar to question 41, but with industry-level instead of market-level aggregation
Don’t forget to use .head(3) at the end to show only the first 3 results

(
    data
    # Step 1: Create year-month identifier and calculate industry-level market-cap weights
    .with_columns(
        # Extract year-month from date (e.g., 20120104 -> "201201")
        year_month=c.date.cast(str).str.slice(0, 6),
        # Each stock's weight within its industry on each day
        stock_weight_in_industry=(c.cap / c.cap.sum().over(c.date, c.industry))
    )
    # Step 2: Calculate daily industry index and stock monthly returns
    .with_columns(
        # Create a value-weighted industry index for each industry on each day
        industry_index=(c.close * c.stock_weight_in_industry).sum().over(c.date, c.industry),
        # Calculate each stock's return for the entire month
        stock_monthly_return=(c.close.last() / c.close.first() - 1).over(c.permno, c.year_month)
    )
    # Step 3: Calculate each industry's monthly return
    .with_columns(
        # Industry monthly return = (last industry index / first industry index - 1)
        industry_monthly_return=(
            c.industry_index.last().over(c.permno, c.year_month) / 
            c.industry_index.first().over(c.permno, c.year_month) - 1
        )
    )
    # Step 4: Select relevant columns and remove duplicates
    .select(
        c.permno,
        c.ticker,
        c.industry,
        c.year_month,
        c.stock_monthly_return,
        c.industry_monthly_return
    )
    .unique()
    # Step 5: Filter for stocks with returns > 2x industry return
    .filter(
        # Stock return must exceed twice the industry return
        c.stock_monthly_return > 2 * c.industry_monthly_return
    )
    # Step 6: Sort and display results
    .sort(c.year_month, c.permno)
    .head(3)  # Show first 3 results
)

Which industry had the largest month-over-month decline in total trading volume for each month?¶

Why This Matters:

Tracking month-over-month changes in industry trading volume reveals sector-level liquidity shifts and investor attention dynamics. Industries experiencing the largest volume declines may be falling out of favor due to negative sentiment, regulatory headwinds, or capital rotation to other sectors. These “cold spots” can signal risk-off behavior, loss of institutional interest, or the end of a sector rally. Conversely, identifying which industries cool down fastest helps portfolio managers understand liquidity risk and potential exit challenges. This analysis is particularly valuable for sector rotation strategies and understanding market regime changes.

(
    data
    # Step 1: Create year-month identifier
    .with_columns(
        # Extract year-month from date (e.g., 2012-01-04 -> "2012-01")
        year_month=c.date.cast(str).str.slice(0, 7)
    )
    # Step 2: Calculate total trading volume for each industry in each month
    .group_by(c.year_month, c.industry)
    .agg(
        # Sum of dollar trading volume (prcvol) for all stocks in the industry
        industry_volume=c.prcvol.sum()
    )
    # Step 3: Sort by industry and month to prepare for calculating changes
    .sort(c.industry, c.year_month)
    # Step 4: Calculate month-over-month change in volume for each industry
    .with_columns(
        # Current month volume - previous month volume
        # Use .shift(1) to get the previous month's volume
        volume_change=(c.industry_volume - c.industry_volume.shift(1).over(c.industry))
    )
    # Step 5: For each month, find the industry with the largest decline
    # (most negative volume_change)
    .filter(
        # Keep only rows where volume_change equals the minimum for that month
        c.volume_change == c.volume_change.min().over(c.year_month)
    )
    # Step 6: Select relevant columns and display results
    .select(
        c.year_month,
        c.industry,
        c.volume_change
    )
    .sort(c.year_month)
    .head(3)  # Show first 3 months
)

Execution Explained:

Step 1: We extract the year-month from each date to group data by month.
Step 2: We aggregate trading volume (prcvol) at the industry-month level using .group_by() and .agg(). This gives us one row per industry per month showing the total trading volume.
Step 3: We sort by industry and month. This is crucial because the next step uses .shift() which relies on proper ordering to get the “previous month” value.
Step 4: We calculate the month-over-month change in volume using .shift(1).over('industry'). For each industry, shift(1) gets the previous month’s volume, and we subtract it from the current month’s volume. Negative values indicate volume declined.
Step 5: For each month, we filter to keep only the industry with the minimum (most negative) volume_change. This is the industry that experienced the largest decline in that month.
Step 6: We select the relevant columns and sort by month to see which industry had the worst volume decline in each month.

Interpreting the Results:

volume_change < 0: The industry’s trading volume declined from the previous month
Large negative values: Indicate severe liquidity drain from that industry
Same industry appearing repeatedly: Suggests sustained pressure or loss of investor interest in that sector
Different industries each month: Indicates rotating patterns of capital flows

What is the maximum drawdown for each stock?¶

Definition:
Maximum drawdown (MDD) measures the largest peak-to-trough decline in a stock’s price over a given period. It is calculated as:

\text{Maximum Drawdown} = \max\left(\frac{\text{Peak Price} - \text{Trough Price}}{\text{Peak Price}}\right)

For example, if a stock peaks at $100 and later drops to$ 70 before recovering, the drawdown at that point is 30%.

Why This Matters:
Maximum drawdown is a critical risk metric that reveals the worst-case loss an investor would have experienced by holding a stock. It’s essential for:

Risk management: Understanding the potential downside helps set appropriate position sizes and stop-loss levels
Investor psychology: Large drawdowns can trigger panic selling; knowing historical MDD helps investors prepare mentally
Performance evaluation: Comparing returns relative to drawdown (e.g., Calmar ratio) identifies stocks with better risk-adjusted performance
Portfolio construction: Stocks with lower MDD are often preferred for risk-averse portfolios

(
    data
    # Step 1: Sort by stock and date to ensure proper time-series operations
    .sort(c.permno, c.date)
    # Step 2: Calculate the running maximum (peak) price for each stock
    # This represents the highest price seen up to each date
    .with_columns(
        peak_price=c.close.cum_max().over(c.permno)
    )
    # Step 3: Calculate drawdown at each date as the percentage decline from the peak
    # Drawdown = (peak - current_price) / peak
    .with_columns(
        drawdown=(c.peak_price - c.close) / c.peak_price
    )
    # Step 4: Find the maximum drawdown for each stock across all dates
    .group_by(c.permno)
    .agg(
        ticker=c.ticker.first(),
        max_drawdown=c.drawdown.max()
    )
    # Step 5: Sort by permno and display first 3 stocks
    .sort(c.permno)
    .head(3)
)

Execution Explained:

Step 1: Sort the data by permno and date to ensure time-series operations work correctly (each stock’s history must be in chronological order).
Step 2: Use .cum_max() to calculate the cumulative maximum (peak) price for each stock up to each date:
- .cum_max() tracks the highest price seen so far
- .over('permno') ensures the calculation is done separately for each stock
- Example: If prices are [100, 95, 110, 105], peak_price would be [100, 100, 110, 110]
Step 3: Calculate the drawdown at each date as (peak_price - close) / peak_price:
- This measures the percentage decline from the peak
- Using the example above: if current price is 105 and peak is 110, drawdown = (110-105)/110 = 0.045 (4.5%)
Step 4: Group by stock and find the maximum drawdown (the worst decline from any peak).
Step 5: Sort by permno and display the first 3 stocks.

Interpretation:
A max_drawdown of 0.30 means the stock experienced a 30% decline from its peak at some point during the period. Lower MDD values indicate less severe price declines and potentially lower risk.

What is the win rate for each stock?¶

Definition:
Win rate (also known as “hit rate” or “success rate”) is the proportion of trading days on which a stock generates positive returns. It is calculated as:

\text{Win Rate} = \frac{\text{Number of days with positive returns}}{\text{Total number of trading days}}

For example, a win rate of 0.55 (or 55%) means the stock had positive returns on 55% of its trading days.

Why This Matters:
Win rate measures the consistency of a stock’s positive performance. A stock with a high win rate (e.g., 60%+) generates positive returns more often, which appeals to risk-averse investors seeking stability. However, win rate alone doesn’t tell the full story—a stock could have a high win rate but small gains, while another with a lower win rate might deliver larger gains when it wins. Understanding win rate helps investors assess behavioral patterns, evaluate risk-adjusted strategies, and combine this metric with profit/loss ratios for a complete performance picture.

(
    data
    # Step 1: Group by stock to calculate win rate for each stock
    .group_by(c.permno)
    .agg(
        # Include ticker for reference
        ticker=c.ticker.first(),
        # Step 2: Calculate win rate as the proportion of days with positive returns
        # Win rate = (number of days with retx > 0) / (total number of days)
        win_rate=(c.retx > 0).sum() / c.retx.count()
    )
    # Step 3: Sort by permno for organized output
    .sort(c.permno)
    .head(3)
)

Execution Explained:

Step 1: Group all observations by permno to analyze each stock separately.
Step 2: Calculate the win rate for each stock:
- (c.retx > 0) creates a boolean mask where True indicates a positive return day
- .sum() counts how many True values (positive return days) exist
- .count() counts the total number of trading days for the stock
- The ratio gives us the proportion of days with positive returns (win rate)
Step 3: Sort by permno and display the first 3 stocks.

Interpretation:
A win rate of 0.55 means the stock had positive returns on 55% of trading days. Higher win rates indicate more consistent positive performance, but remember to also consider the magnitude of gains and losses (profit/loss ratio) for a complete risk assessment.

What is the profit/loss ratio for each stock?¶

Definition:
The profit/loss ratio (also called “win/loss ratio” or “payoff ratio”) measures the magnitude of gains relative to losses. It is calculated as:

\text{Profit/Loss Ratio} = \frac{\sum \text{Positive Returns}}{\left|\sum \text{Negative Returns}\right|}

For example, if a stock’s positive returns sum to +50% and negative returns sum to -25%, the profit/loss ratio is 50/25 = 2.0.

Why This Matters:
The profit/loss ratio complements the win rate to provide a complete picture of trading performance. While win rate tells you how often a stock goes up, the profit/loss ratio tells you how much it gains versus loses:

Ratio > 1.0: The stock gains more when it wins than it loses when it falls (favorable asymmetry)
Ratio < 1.0: The stock loses more when it falls than it gains when it rises (unfavorable asymmetry)
Combined with win rate: A stock with a low win rate (e.g., 40%) can still be profitable if it has a high profit/loss ratio (e.g., 3.0)—meaning big wins compensate for frequent small losses. This is common in momentum and growth stocks.

(
    data
    # Step 1: Group by stock to calculate profit/loss ratio for each stock
    .group_by(c.permno)
    .agg(
        # Include ticker for reference
        ticker=c.ticker.first(),
        # Step 2: Calculate profit/loss ratio
        # Sum of positive returns divided by absolute value of sum of negative returns
        profit_loss_ratio=(
            c.retx.filter(c.retx > 0).sum() /  # Sum of all gains
            c.retx.filter(c.retx < 0).abs().sum()  # Absolute value of sum of all losses
        )
    )
    # Step 3: Sort by permno for organized output
    .sort(c.permno)
    .head(3)
)

Execution Explained:

Step 1: Group all observations by permno to analyze each stock separately.
Step 2: Calculate the profit/loss ratio for each stock:
- .filter(c.retx > 0) selects only the positive return days
- .sum() adds up all the positive returns (total gains)
- .filter(c.retx < 0) selects only the negative return days
- .abs().sum() takes the absolute value and sums all losses (making them positive for division)
- The ratio tells us how much the stock gained relative to how much it lost
Step 3: Sort by permno and display the first 3 stocks.

Interpretation:

A profit_loss_ratio of 1.5 means the stock gained 50% more than it lost (e.g., +30% total gains vs. -20% total losses)
A ratio of 0.8 means the stock lost more than it gained (unfavorable), even if it had a high win rate
Key insight: Combine this with win rate—a stock with 45% win rate and 2.5 profit/loss ratio is still profitable because its big wins outweigh frequent small losses

Practice Exercise: What is the market’s win rate?¶

Your Task:
Calculate the market’s win rate. You need to:

Calculate the market-cap weighted return for each day
Determine the proportion of days with positive market returns

Hints:

Compute each stock’s weight: stock_weight = market_cap / total_market_cap
Calculate market return: market_return = sum(stock_return × stock_weight)
Compute win rate: positive days / total days
Use .over('date') for within-day calculations
Remove duplicate dates when counting

Try it yourself before checking the solution!

Solution:

(
    data
    # Step 1: Calculate stock returns and market-cap weights for each stock
    .with_columns(
        stock_return=c.retx,
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate value-weighted market return for each day
    .with_columns(
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Step 3: Select unique dates with their market returns
    .select([c.date, c.market_return])
    .unique()
    # Step 4: Calculate the market's win rate
    .select(
        market_win_rate=(c.market_return > 0).sum() / c.market_return.count()
    )
)

Key Differences from Individual Stock Win Rate:

Aggregation level: Instead of grouping by stock (permno), we compute a single market-level return for each date by taking the weighted average of all stocks
Weighting: Each stock’s return is weighted by its market cap (cap / total_cap) to reflect its influence on the overall market
Single output: The result is one number (the market’s win rate) rather than one win rate per stock

Interpretation:
A market win rate of 0.52 (52%) means the market had positive returns on 52% of trading days. This is typical for equity markets, which have a slight upward bias over time.

What is the market’s profit/loss ratio?¶

Definition:
The market’s profit/loss ratio measures the magnitude of the market’s aggregate gains relative to its aggregate losses over the entire period. Unlike individual stock profit/loss ratios, this metric uses market-cap weighted returns to reflect the market as a whole.

Why This Matters:
The market’s profit/loss ratio reveals the asymmetry between up days and down days at the aggregate level. A ratio above 1.0 indicates that the market gains more on up days than it loses on down days, which is a key feature of long-term equity market growth. Combined with the market’s win rate, this helps explain why passive investing in diversified portfolios tends to be profitable over the long term—even if the win rate is only slightly above 50%, the gains on winning days often exceed losses on losing days. This asymmetry is partly driven by the compounding nature of returns and the fact that large-cap stocks (which have higher weights) often exhibit different risk-return profiles than smaller stocks.

(
    data
    # Step 1: Calculate stock returns and market-cap weights for each stock
    .with_columns(
        stock_return=c.retx,
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate value-weighted market return for each day
    .with_columns(
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Step 3: Select unique dates with their market returns
    .select([c.date, c.market_return])
    .unique()
    # Step 4: Calculate the market's profit/loss ratio
    # Sum of all positive returns divided by absolute sum of all negative returns
    .select(
        market_profit_loss_ratio=(
            c.market_return.filter(c.market_return > 0).sum() /
            c.market_return.filter(c.market_return < 0).abs().sum()
        )
    )
)

Execution Explained:

Step 1: Calculate each stock’s return and its weight in the overall market (market cap / total market cap on each date).
Step 2: Calculate the value-weighted market return for each day by taking the weighted average: sum(stock_return × stock_weight).
Step 3: Select unique dates with their corresponding market returns to get one observation per trading day.
Step 4: Calculate the market’s profit/loss ratio:
- .filter(c.market_return > 0) selects only the days when the market went up
- .sum() adds up all the positive market returns (total market gains)
- .filter(c.market_return < 0) selects only the days when the market went down
- .abs().sum() takes the absolute value and sums all losses (total market losses)
- The ratio tells us how much the market gained relative to how much it lost

Interpretation:
A market_profit_loss_ratio of 1.15 means the market gained 15% more than it lost over the period. This positive asymmetry, combined with a win rate slightly above 50%, explains why long-term passive investing tends to be profitable. The market’s “up days” collectively deliver more than the “down days” take away.

Key Insight:
This metric complements the market’s win rate (Question 51). Together, they reveal:

Win rate ~52% + Profit/loss ratio ~1.15: The market goes up slightly more often, AND gains more when it goes up
This double advantage creates the long-term upward drift that benefits buy-and-hold investors

What is the win rate for each industry?¶

Definition:
Each industry’s win rate is the proportion of trading days on which that industry (measured as a market-cap weighted index of all stocks within the industry) generates positive returns. This metric shows how consistently different sectors perform.

Why This Matters:
Industry win rates reveal which sectors have more stable, consistent performance versus volatile, cyclical behavior. For example, defensive sectors like Utilities often have higher win rates (more frequent but smaller gains), while cyclical sectors like Technology might have lower win rates but larger gains when they do win. Understanding these patterns helps investors with sector rotation strategies, portfolio diversification, and risk management. Industries with high win rates may be suitable for conservative portfolios, while those with lower win rates but high profit/loss ratios might appeal to growth-oriented investors willing to accept more volatility.

(
    data
    # Step 1: Calculate stock returns and industry-level market-cap weights
    .with_columns(
        stock_return=c.retx,
        # Weight of each stock within its industry on each day
        stock_weight_in_industry=(c.cap / c.cap.sum().over(c.date, c.industry))
    )
    # Step 2: Calculate value-weighted industry return for each industry on each day
    .with_columns(
        industry_return=(c.stock_return * c.stock_weight_in_industry).sum().over(c.date, c.industry)
    )
    # Step 3: Select unique (industry, date) combinations with their returns
    .select([c.industry, c.date, c.industry_return])
    .unique()
    # Step 4: Group by industry and calculate win rate for each industry
    .group_by(c.industry)
    .agg(
        # Win rate = proportion of days with positive returns
        industry_win_rate=(c.industry_return > 0).sum() / c.industry_return.count()
    )
    # Step 5: Sort by industry win rate
    .sort(c.industry_win_rate, descending=True)
    .head(3)
)

Execution Explained:

Step 1: Calculate each stock’s return and its weight within its industry (stock’s market cap / total industry market cap on each day).
Step 2: Calculate the value-weighted industry return for each industry on each day by taking the weighted average within each industry: sum(stock_return × stock_weight_in_industry).
Step 3: Select unique (industry, date) combinations to get one observation per industry per day, along with the industry’s return for that day.
Step 4: Group by industry and calculate the win rate for each industry:
- (c.industry_return > 0) creates a boolean mask for days when the industry had positive returns
- .sum() counts the number of positive return days
- .count() counts the total number of trading days for each industry
- The ratio gives the proportion of days with positive returns (win rate)
Step 5: Sort by industry name and display the first 3 industries.

Interpretation:

An industry_win_rate of 0.55 means that industry had positive returns on 55% of trading days
Higher win rates (e.g., 0.60+) suggest more consistent, stable performance—common in defensive sectors like Utilities or Consumer Staples
Lower win rates (e.g., 0.45-0.50) might indicate more volatile sectors like Technology or Energy, which may have bigger swings but potentially larger gains when they do win

Can a stock’s monthly trading volume exceed its industry’s daily trading volume?¶

A Question Out of Curiosity:
This is more of an exploratory exercise than a core financial analysis question. We’re checking if there are extreme cases where a single stock is so dominant within its industry that its entire month’s trading volume exceeds what the entire industry trades on any single day in that same month. While this doesn’t have direct investment implications, it reveals interesting patterns about market concentration, liquidity distribution, and dominant players within sectors. The answer might surprise you and highlight just how concentrated trading can be in certain mega-cap stocks.

(
    data
    # Step 1: Create year-month identifier and calculate daily industry trading volume
    .with_columns(
        year_month=c.date.cast(str).str.slice(0, 7),
        # Total trading volume for the entire industry on each day
        industry_daily_volume=c.prcvol.sum().over(c.date, c.industry)
    )
    # Step 2: Calculate each stock's total monthly trading volume
    .with_columns(
        # Sum of a stock's trading volume across all days in the month
        stock_monthly_volume=c.prcvol.sum().over(c.year_month, c.permno)
    )
    # Step 3: Filter for cases where stock's monthly volume > industry's daily volume
    .filter(
        c.stock_monthly_volume > c.industry_daily_volume
    )
    # Step 4: Select relevant columns
    .select([
        c.permno,
        c.ticker,
        c.year_month,
        c.stock_monthly_volume,
        c.date,
        c.industry_daily_volume
    ])
    # Step 5: Remove duplicates and sort for readability
    .unique()
    .sort(c.permno, c.year_month)
    .head(3)
)

Execution Explained:

Step 1: Create a year_month identifier (e.g., “2023-01”) and calculate the total trading volume for each industry on each day by summing all stocks’ prcvol within that industry-date combination.
Step 2: Calculate each stock’s total monthly trading volume by summing its prcvol across all trading days in that month.
Step 3: Filter to keep only the cases where a stock’s monthly volume exceeds its industry’s daily volume on that particular date.
Step 4: Select the relevant columns: stock identifier (permno, ticker), month (year_month), the stock’s monthly volume, the specific date, and the industry’s daily volume on that date.
Step 5: Remove duplicate rows and sort by stock and month for organized output.

Interpretation:
Each row shows a (stock, month, date) combination where the stock’s total monthly trading volume exceeded what its entire industry traded on that specific day. This indicates extreme dominance—the stock is so heavily traded that its whole month’s activity surpasses its industry’s single-day total.

This is a fun exploratory question that highlights just how concentrated trading activity can be in certain stocks!

How many stocks enter or exit each industry on each day?¶

Why This Matters:
Tracking industry reclassifications reveals important structural changes in the market. When companies change industries (e.g., a retailer becoming more tech-focused), it signals strategic pivots that can affect investment theses. Understanding the frequency and patterns of these changes helps with:

Index tracking: Industry-based ETFs and sector funds need to rebalance when stocks move between industries, creating trading opportunities
Risk management: Sudden exits from an industry may indicate distress or restructuring
Sector momentum: Industries with high entry rates might be attracting growth capital
Classification changes: Large-scale reclassifications (like when GICS updated sector definitions) create rebalancing events that active traders can exploit

This analysis helps identify when the market’s industry structure is evolving.

(
    data
    # Step 1: Sort by stock and date to track each stock's history
    .sort(c.permno, c.date)
    # Step 2: Create columns to track previous industry and date for each stock
    .with_columns(
        # Get the previous day's industry for this stock
        industry_prev=c.industry.shift(1).over(c.permno),
        # Get the previous day's date to detect gaps
        date_prev=c.date.shift(1).over(c.permno)
    )
    # Step 3: Create indicators for entering and exiting industries
    .with_columns(
        # A stock "enters" an industry if:
        # 1) It changed industries from the previous day, OR
        # 2) It's the first day we see this stock (industry_prev is null)
        industry_in=pl.when(
            (c.industry != c.industry_prev) | c.industry_prev.is_null()
        ).then(1).otherwise(0),
        # A stock "exits" an industry if:
        # It changed industries AND had an industry yesterday (not new stock)
        industry_out=pl.when(
            (c.industry != c.industry_prev) & c.industry_prev.is_not_null()
        ).then(1).otherwise(0)
    )
    # Step 4: Count entries and exits for each industry on each day
    .group_by(c.date, c.industry)
    .agg(
        in_count=c.industry_in.sum(),
        out_count=c.industry_out.sum()
    )
    # Step 5: Sort by date and industry, display first 3 rows
    .sort(c.date, c.industry)
    # show the result on day 2023-01-11
    .filter(c.date == pl.date(2023, 1, 11))
)

Execution Explained:

Step 1: Sort the data by permno and date so we can track each stock’s time series chronologically.
Step 2: Use .shift(1).over('permno') to get the previous day’s industry and date for each stock:
- industry_prev: What industry was this stock in yesterday?
- date_prev: What was yesterday’s date? (helps detect data gaps)
Step 3: Create binary indicators for industry transitions:
- industry_in = 1 if the stock’s industry today differs from yesterday OR if this is the first time we see the stock (new listing)
- industry_out = 1 if the stock changed industries AND had an industry yesterday (excludes new listings)
Step 4: Group by (date, industry) and count the number of stocks entering (in_count) and exiting (out_count) each industry on each day.
Step 5: Sort by date and industry for organized output.

Interpretation:

in_count: Number of stocks that entered this industry on this date (either reclassified from another industry or newly listed)
out_count: Number of stocks that left this industry on this date (reclassified to a different industry)
An in_count of 5 and out_count of 2 means 5 stocks entered and 2 stocks left that industry on that day

Common Patterns:

Most days will show in_count = 0 and out_count = 0 for most industries (stable classifications)
Spikes in entries/exits indicate:
- Index rebalancing events (e.g., quarterly reviews)
- GICS reclassifications (when industry definitions change)
- Corporate actions (mergers, spin-offs, business model shifts)
- New IPOs (stocks entering their initial industry)

How well does trading volume predict next-day returns?¶

Advanced Predictive Modeling Challenge:
This is a tough one! We’re exploring whether unusual trading activity today can predict abnormal returns tomorrow. The approach: standardize each day’s trading volumes using z-scores (subtract mean, divide by standard deviation), then use yesterday’s z-score to predict today’s returns via linear regression. The R² (coefficient of determination) tells us what proportion of return variation is explained by this volume signal.

Why This Matters:
This question gets at the heart of quantitative trading: Can we use observable signals to predict future returns? Trading volume is one of the most scrutinized predictors in finance. High volume might signal:

Informed trading: Smart money moving in, potentially predicting price changes
Momentum: High-volume rallies continuing into the next day
Mean reversion: Extreme volume causing overshooting that reverses

The R² tells us if this is just noise or a real signal. Even a low R² (e.g., 5-10%) can be profitable if consistent across stocks. This type of analysis is foundational in algorithmic trading and factor investing. Note: We calculate z-scores cross-sectionally (comparing stocks on the same day) to isolate relative volume spikes rather than absolute levels.

import polars_ds as pds

(
    data
    # Step 1: Calculate cross-sectional z-scores of trading volume for each day
    .with_columns(
        # Z-score: (value - mean) / std_dev
        # We compute this cross-sectionally (comparing stocks on the same day)
        volume_z_score=(
            (c.prcvol - c.prcvol.mean().over(c.date)) / 
            c.prcvol.std().over(c.date)
        )
    )
    # Step 2: Sort by stock and date, then shift z-score to get yesterday's value
    .sort(c.permno, c.date)
    .with_columns(
        # Get yesterday's z-score for each stock
        volume_z_score_lag=c.volume_z_score.shift(1).over(c.permno),
        # Today's return is our target variable
        stock_return=c.retx
    )
    # Step 2.5: Filter to keep only stocks with sufficient observations
    # We need at least 30 observations per stock for meaningful regression
    .with_columns(
        num_obs=c.permno.count().over(c.permno)
    )
    .filter(c.num_obs >= 30)
    # Also drop rows with null z-scores (first day for each stock)
    .filter(c.volume_z_score_lag.is_not_null())
    # Step 3: For each stock, run regression: today's return ~ yesterday's z-score
    .select(
        c.permno,
        c.ticker,
        c.volume_z_score_lag,
        c.stock_return,
        # Run linear regression for each stock separately
        pds.lin_reg(
            c.volume_z_score_lag,
            target=c.stock_return,
            add_bias=True,  # Include intercept
            return_pred=True  # Return predictions and residuals
        )
        .over(c.permno)
        .alias('regression_results')
    )
    # Step 4: Unnest regression results
    .unnest('regression_results')
    # Step 5: Calculate R² for each stock
    .group_by(c.permno, c.ticker)
    .agg(
        # R² = 1 - (SS_residual / SS_total)
        # SS_residual = sum of squared residuals
        # SS_total = sum of squared deviations from mean
        r_squared=(
            1 - 
            (c.resid.pow(2).sum()) / 
            ((c.stock_return - c.stock_return.mean()).pow(2).sum())
        )
    )
    # Step 6: Filter out invalid R² values and sort by R²
    .filter(c.r_squared.is_finite())
    .sort(c.r_squared, descending=True)
    .head(3)  # Show top 3 stocks with highest R²
)

Execution Explained:

Step 1: Calculate cross-sectional z-scores of trading volume for each date:
- For each day, compute the mean and standard deviation of prcvol across all stocks
- Z-score = (stock’s volume - daily mean) / daily std deviation
- This standardization tells us how unusual a stock’s volume is relative to other stocks that day
Step 2: Sort by stock and date, then shift the z-score by 1 day:
- volume_z_score_lag = yesterday’s z-score
- We’re testing if yesterday’s unusual volume predicts today’s return
Step 2.5: Filter to ensure data quality:
- Count observations per stock and keep only stocks with at least 30 observations
- This prevents regression errors when stocks have too few data points
- Also drop rows where volume_z_score_lag is null (the first day for each stock has no previous day)
Step 3: Run linear regression for each stock:
- Model: today's return ~ yesterday's z-score + intercept
- The pds.lin_reg() function fits this model separately for each stock
- Returns predictions and residuals
Step 4: Unnest the regression results to access individual components
Step 5: Calculate R² for each stock:
- R² = 1 - (sum of squared residuals) / (total sum of squares)
- R² ranges from 0 to 1, where:
  - 0 = volume has no predictive power
  - 1 = volume perfectly predicts returns (unrealistic)
  - 0.05-0.10 = weak but potentially tradeable signal
Step 6: Filter out invalid values, sort by R², and show top 3 stocks

Interpretation:

An r_squared of 0.15 means yesterday’s volume z-score explains 15% of the variation in today’s returns for that stock
Stocks with higher R² values have more predictable return patterns based on volume
In practice, even R² values of 0.05-0.10 can be useful in quantitative strategies
Note: High R² doesn’t guarantee profitability—you’d need to account for transaction costs, slippage, and out-of-sample validation

Why Cross-Sectional Z-Scores?
We standardize relative to other stocks on the same day because:

Market-wide volume changes (high volume days vs. low volume days) are less informative
We want to identify stocks with unusually high or low volume compared to their peers
This approach isolates stock-specific volume signals from market-wide effects

What proportion of top 100 intraday gainers are also top 100 daily gainers?¶

Why This Matters:
This question explores the relationship between intraday momentum (stocks that surge from open to their daily high) and daily momentum (stocks that close higher than yesterday). Understanding this overlap reveals whether early-morning price action predicts end-of-day performance. If there’s high overlap, it suggests that stocks that start strong tend to finish strong—valuable for day traders and momentum strategies. Conversely, low overlap might indicate mean reversion (early gains fade) or late-day reversals. This analysis helps traders understand whether to hold positions that gap up at open or take profits early. It’s also relevant for understanding market microstructure: do informed traders act early, or do retail investors push prices higher throughout the day?

(
    data
    # Step 1: Calculate two types of returns
    .with_columns(
        # Intraday return: from open price to high price (best intraday gain)
        intraday_gain=(c.high / c.open - 1),
        # Daily return: from yesterday's close to today's close
        daily_gain=c.retx
    )
    # Step 2: For each date, identify top 100 stocks by each metric
    .group_by(c.date)
    .agg(
        # Top 100 stocks with largest intraday gains (open to high)
        top100_intraday=c.permno.sort_by(c.intraday_gain, descending=True).head(100),
        # Top 100 stocks with largest daily gains (yesterday close to today close)
        top100_daily=c.permno.sort_by(c.daily_gain, descending=True).head(100)
    )
    # Step 3: Calculate the overlap (intersection) between the two lists
    .with_columns(
        # Find stocks that appear in both top 100 lists
        overlap_count=c.top100_intraday.list.set_intersection(c.top100_daily).list.len(),
    )
    .with_columns(
        # Calculate proportion: overlap / 100
        overlap_proportion=c.overlap_count / 100.0
    )
    
    # Step 4: Select relevant columns and sort by date
    .select([
        c.date,
        c.overlap_count,
        c.overlap_proportion
    ])
    .sort(c.date)
    .head(3)  # Show first 3 days
)

Execution Explained:

Step 1: Calculate two different return metrics:
- intraday_gain: (high / open - 1) measures the best intraday gain from opening price to the day’s highest price
- daily_gain: c.retx is the daily return from yesterday’s close to today’s close
Step 2: For each date, identify the top 100 stocks by each metric:
- top100_intraday: Stocks with the 100 largest gains from open to high
- top100_daily: Stocks with the 100 largest daily returns
- We use .sort_by() with descending=True and .head(100) to get the top performers
Step 3: Calculate the overlap between the two lists:
- .list.set_intersection() finds stocks that appear in both top 100 lists
- .list.len() counts how many stocks are in the intersection
- overlap_proportion = overlap_count / 100.0 gives us the proportion (0.0 to 1.0)
Step 4: Select relevant columns, sort by date, and display first 3 days.

Interpretation:

An overlap_proportion of 0.60 means 60 out of 100 stocks (60%) that had the best intraday gains also had the best daily gains
High overlap (0.6-0.8): Suggests strong momentum—stocks that surge early tend to finish strong
Low overlap (0.2-0.4): Suggests mean reversion or late-day reversals—early gains don’t persist
Very high overlap (0.8+): Indicates consistent momentum throughout the day, potentially driven by strong fundamental news or sustained buying pressure

Trading Implications:

High overlap days: Consider holding positions that gap up at open
Low overlap days: Consider taking profits early, as early gains may fade
This metric helps identify market regimes (momentum vs. mean reversion)

Which stocks have ranked in the top 100 by excess return for 3 consecutive days?¶

Why This Matters:
This question identifies stocks with sustained alpha—consistent outperformance relative to the market over multiple days. Finding stocks that rank in the top 100 by excess return (market-adjusted return) for 3 consecutive days is valuable for several reasons: Momentum strategies often look for stocks with persistent outperformance, as it suggests strong fundamentals or positive news flow. Institutional attention may be building, leading to continued buying pressure. Risk management benefits from understanding which stocks are consistently beating the market, as they may be less correlated with market downturns. However, this pattern can also signal overvaluation or momentum exhaustion—stocks that have run too far, too fast. This analysis helps identify both opportunities (strong momentum) and risks (potential reversals) in active portfolio management.

import polars_ds as pds

(
    data
    # Step 1: Calculate stock returns and market-cap weights
    .with_columns(
        stock_return=c.retx,
        stock_weight=(c.cap / c.cap.sum().over(c.date))
    )
    # Step 2: Calculate value-weighted market return for each day
    .with_columns(
        market_return=(c.stock_return * c.stock_weight).sum().over(c.date)
    )
    # Step 3: Filter stocks with sufficient observations for regression
    .with_columns(num_obs=c.permno.count().over(c.permno))
    .filter(c.num_obs >= 30)
    # Step 4: Calculate excess returns (alpha) using linear regression
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.stock_return,
        c.market_return,
        # Run regression: stock_return ~ market_return for each stock
        pds.lin_reg(
            c.market_return,
            target=c.stock_return,
            add_bias=True,
            return_pred=True
        )
        .over(c.permno)
        .alias('regression_results')
    )
    # Step 5: Unnest regression results to get residuals (excess returns)
    .unnest('regression_results')
    # Step 6: Rank stocks by excess return (residual) for each day
    .sort(c.date, c.resid, descending=[False, True])
    .with_columns(
        # Rank by excess return, descending (highest alpha = rank 1)
        excess_return_rank=c.resid.rank(descending=True).over(c.date)
    )
    # Step 7: Create binary indicator for top 100 stocks each day
    .with_columns(
        is_top_100=pl.when(c.excess_return_rank <= 100).then(1).otherwise(0)
    )
    # Step 8: Sort by stock and date, then calculate rolling 3-day sum
    .sort(c.permno, c.date)
    .with_columns(
        # Count how many of the last 3 days the stock was in top 100
        days_top_100=c.is_top_100.rolling_sum(window_size=3).over(c.permno)
    )
    # Step 9: Filter for stocks that were top 100 all 3 days
    .filter(c.days_top_100 == 3)
    # Step 10: Group by date and aggregate stocks
    .group_by(c.date)
    .agg(
        # Collect all stocks that met the criteria for this date
        stocks=c.permno.unique().sort()
    )
    # Step 11: Sort by date and display first 3 days
    .sort(c.date)
    .head(3)
)

Execution Explained:

Step 1: Calculate stock returns and market-cap weights for each stock on each day.
Step 2: Calculate the value-weighted market return for each day (weighted average of all stock returns).
Step 3: Filter to keep only stocks with at least 30 observations (needed for stable regression).
Step 4: Calculate excess returns (alpha) using linear regression:
- Model: stock_return ~ market_return + intercept
- Run separately for each stock using .over('permno')
- The residual from this regression is the excess return (alpha)
Step 5: Unnest regression results to access the resid column (excess returns).
Step 6: Rank stocks by excess return for each day:
- Sort by date and excess return (descending)
- Assign ranks where rank 1 = highest excess return
Step 7: Create binary indicator: is_top_100 = 1 if rank ≤ 100, else 0.
Step 8: Calculate rolling 3-day sum:
- Sort by stock and date
- For each stock, count how many of the last 3 days it was in the top 100
Step 9: Filter for stocks where days_top_100 == 3 (top 100 all 3 days).
Step 10: Group by date and collect all qualifying stocks.
Step 11: Sort by date and display first 3 days.

Interpretation:
Each row shows a date and a list of stocks (permno values) that ranked in the top 100 by excess return for that day and the previous 2 days. For example, if a date shows [10026, 10028, 10032], it means these 3 stocks had top-100 excess returns on that day, the day before, and 2 days before.

Practice Challenge: Which stocks have ranked in the top 30% by industry-adjusted excess return for 3 consecutive days?¶

Final Challenge Exercise:
This is the last question and it’s a tough one! Building on the previous question, this challenge asks you to find stocks with sustained industry-relative alpha—stocks that consistently outperform their industry peers. Instead of comparing to the overall market, we’re looking at industry-adjusted returns (how much a stock beats its industry), and instead of top 100, we’re looking at top 30% within each industry. This is more nuanced because: (1) you need to calculate industry-adjusted returns using regression, (2) rank stocks within each industry (not across all stocks), (3) identify the top 30% within each industry, and (4) find stocks that were in the top 30% for 3 consecutive days. This type of analysis is crucial for sector-neutral strategies and relative strength investing—finding the best stocks within each sector regardless of overall market conditions.

Your Task:
Find stocks that ranked in the top 30% by industry-adjusted excess return (within their industry) for 3 consecutive days.

Hints:

Calculate industry-adjusted returns using linear regression: stock_return ~ industry_return
Rank stocks by excess return within each (date, industry) group (not just by date)
Top 30% means rank ≤ (total stocks in industry × 0.3)
Use rolling sum to count consecutive days in top 30%
Filter for stocks with days_in_top30pct == 3
Group by date and collect qualifying stocks

Try to solve it yourself before checking the solution below!

Solution:

import polars_ds as pds

(
    data
    # Step 1: Calculate stock returns and industry-level market-cap weights
    .with_columns(
        stock_return=c.retx,
        stock_weight_in_industry=(c.cap / c.cap.sum().over(c.date, c.industry))
    )
    # Step 2: Calculate value-weighted industry return for each industry on each day
    .with_columns(
        industry_return=(c.stock_return * c.stock_weight_in_industry).sum().over(c.date, c.industry)
    )
    # Step 3: Filter stocks with sufficient observations for regression
    .with_columns(num_obs=c.permno.count().over(c.permno))
    .filter(c.num_obs >= 30)
    # Step 4: Calculate industry-adjusted excess returns using linear regression
    .select(
        c.permno,
        c.ticker,
        c.date,
        c.industry,
        c.stock_return,
        c.industry_return,
        # Run regression: stock_return ~ industry_return for each stock
        pds.lin_reg(
            c.industry_return,
            target=c.stock_return,
            add_bias=True,
            return_pred=True
        )
        .over(c.permno)
        .alias('regression_results')
    )
    # Step 5: Unnest regression results to get residuals (industry-adjusted excess returns)
    .unnest('regression_results')
    # Step 6: Rank stocks by excess return WITHIN each industry on each day
    .sort(c.date, c.industry, c.resid, descending=[False, False, True])
    .with_columns(
        # Rank within each (date, industry) group
        rank_in_industry=c.resid.rank(descending=True).over(c.date, c.industry),
        # Count total stocks in each industry on each day
        total_in_industry=c.permno.count().over(c.date, c.industry)
    )
    # Step 7: Create indicator for top 30% within industry
    .with_columns(
        # Top 30% means rank <= (total * 0.3)
        is_top_30pct=pl.when(c.rank_in_industry <= (c.total_in_industry * 0.3)).then(1).otherwise(0)
    )
    # Step 8: Sort by stock and date, then calculate rolling 3-day sum
    .sort(c.permno, c.date)
    .with_columns(
        # Count how many of the last 3 days the stock was in top 30% of its industry
        days_in_top30pct=c.is_top_30pct.rolling_sum(window_size=3).over(c.permno)
    )
    # Step 9: Filter for stocks that were top 30% all 3 days
    .filter(c.days_in_top30pct == 3)
    # Step 10: Group by date and collect qualifying stocks
    .group_by(c.date)
    .agg(
        stocks=c.permno.unique().sort()
    )
    # Step 11: Sort by date and display first 3 days
    .sort(c.date)
    .head(3)
)

Key Differences from Question 62:

Industry-adjusted vs. market-adjusted: We calculate excess returns relative to the stock’s industry, not the overall market
Top 30% within industry vs. top 100 overall: We rank stocks within each industry and take the top 30% (not a fixed number)
Ranking by (date, industry): We use .over(c.date, c.industry) to rank within each industry-day combination
Dynamic threshold: Top 30% means rank <= total_in_industry * 0.3, which varies by industry size

Interpretation:
Each row shows a date and a list of stocks that ranked in the top 30% by industry-adjusted excess return for that day and the previous 2 days. This identifies stocks with sustained relative strength within their sectors—valuable for sector-neutral momentum strategies.

Starting Here...¶

Load the Packages¶

Inspect the Data¶

Find out all stocks whose ticker contains "COM"¶

How many stocks on each day have positive/negative returns?¶

How many stocks on each day have positive/negative returns (only select stocks with market capital larger than $100M)?¶

How many stocks are there in each industry on each day?¶

On each day, is the industry with the most stocks also generates the largest trading volume?¶

How many stocks on each day have a gain of more than 5% and a loss more than 5%?¶

What are the daily total trading volume of the top 10 stocks with the largest daily increase and the top 10 stocks with the largest daily decrease?¶

How many stocks increase by 5% at the time of open and close?¶

What’s the correlation coefficient between trading volume and return?¶

What’s the correlation between industry-level trading volume changes and return dispersion within industries?¶

What is the industry distribution of S&P 500 constituent stocks?¶

What’s the correlation matrix of daily returns across different industries?¶

Which stocks have the largest weight in the S&P 500 index each day?¶

Which stock has the highest trading volume in each industry on each day?¶

What is the ratio of top 10% to bottom 10% trading volume within each industry on each day?¶

What is the average return of high-volume stocks (above 80th percentile) within each industry on each day?¶

What is the correlation between average returns of high-volume stocks (top 10%) and low-volume stocks (bottom 10%) across days?¶

Which industries have above-market trading volume on each day?¶

What is the excess return (alpha) of each stock relative to the market?¶

What is the excess return (alpha) of each stock relative to its industry?¶

What is the correlation between market-adjusted and industry-adjusted excess returns for each stock?¶

How many stocks on each day have had 3 consecutive days of gains or losses?¶

How many stocks on each day have outperformed the market for 3 consecutive days?¶

How many stocks on each day have outperformed the market in at least 4 out of the last 5 trading days?¶

Which stocks in each month have returns exceeding twice the market’s monthly return?¶

Practice Exercise: Which stocks in each month have returns exceeding twice their industry’s monthly return?¶

Which industry had the largest month-over-month decline in total trading volume for each month?¶

What is the maximum drawdown for each stock?¶

What is the win rate for each stock?¶

What is the profit/loss ratio for each stock?¶

Practice Exercise: What is the market’s win rate?¶

What is the market’s profit/loss ratio?¶

What is the win rate for each industry?¶

Can a stock’s monthly trading volume exceed its industry’s daily trading volume?¶

How many stocks enter or exit each industry on each day?¶

How well does trading volume predict next-day returns?¶

What proportion of top 100 intraday gainers are also top 100 daily gainers?¶

Which stocks have ranked in the top 100 by excess return for 3 consecutive days?¶

Practice Challenge: Which stocks have ranked in the top 30% by industry-adjusted excess return for 3 consecutive days?¶

Find out all stocks whose ticker contains `"COM"`¶