Python Tools for Management Research

1c: Polars III

Jason T. Kiley

Polars III

Now we make the same data work easier to read and easier to scale.

This segment helps us scale: lazy queries and functions for readable pipelines.

Why lazy?

Lazy Polars lets us describe the work first and run it later.

That gives Polars room to optimize the query, avoid unnecessary work, and handle larger data nicely on typical research hardware.

The lazy API usage docs have the full version.

Why functions?

A huge chained call can work and still be hard to maintain.

firmyear_pipeline = (
    firmyear_source.with_columns(...)
    .rename(...)
    .sort(...)
    .join(
        lookup,
        on="name",
        how="left",
        validate="m:1",
    )
    .join(
        pl.scan_csv(DATA_DIR / "stock.csv"),
        left_on=["id_ticker", "year"],
        right_on=["tic", "yr"],
        how="left",
        validate="1:1",
    )
    .join(
        pl.scan_csv(DATA_DIR / "msft_nyt.csv", try_parse_dates=True)
        .with_columns(...)
        .group_by("id_ticker", "year")
        .agg(...)
        .sort("id_ticker", "year"),
        on=["id_ticker", "year"],
        how="left",
        validate="1:1",
    )
    .with_columns(...)
)

Why functions?

Functions let us give names to research-data steps:

  • read firm-year data;
  • add identifiers;
  • summarize articles;
  • join supporting data;
  • save dated outputs.

Hands-on

Open notebooks/1c_polars.ipynb.