Python Tools for Management Research

1a: Polars I

Jason T. Kiley

Polars I

Let’s turn tabular data work into clear, reproducible Python.

This segment is the basic workflow: read data, inspect it, clean it, transform it, and write it back out.

Why dataframes?

Most empirical research data work is tabular data work.

Rows

Observations, records, documents, firm-years, estimates, events.

Columns

Variables, identifiers, timestamps, measures, labels.

Operations

Select, filter, transform, aggregate, join, and write.

Pandas and Polars

Pandas

Older, everywhere, and deeply connected to the Python data ecosystem.

Extremely useful when another package hands us a pandas object, or when pandas reads a format we need.

Polars

Newer, fast, expression-oriented, and designed around efficient dataframe operations.

Useful for research pipelines where we care about readable transformations and performance.

Polars has a short guide for users coming from pandas.

How we’ll use both

Some research formats are still easiest to read with pandas.

import pandas as pd
import polars as pl

firmyear_pd = pd.read_stata("../data/firmyear.dta")
firmyear = pl.from_pandas(firmyear_pd)

Then we do the dataframe work in Polars.

pl.from_pandas() gives us an easy way to use pandas to read data and then work with it in Polars.

Polars philosophy

Polars benefitted from observing the many years of pandas and starting with a clean slate.

Expressions

Use pl.col() and friends to describe column computations.

Pipelines

Chain operations so the data work reads as a sequence of decisions.

Lazy option

Build a query first, then run it when we ask for the result.

The Polars expression docs are the main reference.

Why the design matters

Polars is designed to stay inside optimized dataframe operations instead of bouncing in and out of slow row-by-row Python work.

The lazy API can see a whole query before running it, which enables query optimization and can reduce unnecessary work. We’ll talk about that in 1c.

The lazy API docs have the deeper version.

What we will practice

Read and inspect

Load a Stata file through pandas, convert to Polars, and inspect shape, columns, schema, and nulls.

Clean and transform

Cast types, rename columns, select columns, filter rows, and create new variables.

Work by group

Use .over() for firm-level calculations inside firm-year data.

Write outputs

Write dated CSV and Parquet files that can be reused later.

Hands-on

Open notebooks/1a_polars.ipynb.