1c: Polars III
Now we make the same data work easier to read and easier to scale.
This segment helps us scale: lazy queries and functions for readable pipelines.
Lazy Polars lets us describe the work first and run it later.
That gives Polars room to optimize the query, avoid unnecessary work, and handle larger data nicely on typical research hardware.
The lazy API usage docs have the full version.
A huge chained call can work and still be hard to maintain.
firmyear_pipeline = (
firmyear_source.with_columns(...)
.rename(...)
.sort(...)
.join(
lookup,
on="name",
how="left",
validate="m:1",
)
.join(
pl.scan_csv(DATA_DIR / "stock.csv"),
left_on=["id_ticker", "year"],
right_on=["tic", "yr"],
how="left",
validate="1:1",
)
.join(
pl.scan_csv(DATA_DIR / "msft_nyt.csv", try_parse_dates=True)
.with_columns(...)
.group_by("id_ticker", "year")
.agg(...)
.sort("id_ticker", "year"),
on=["id_ticker", "year"],
how="left",
validate="1:1",
)
.with_columns(...)
)Functions let us give names to research-data steps:
Open notebooks/1c_polars.ipynb.