I’ve been building ai-file-analyzer, a Rust library that uses Google’s Gemini AI to parse Australian Tax Office (ATO) transaction files. The normal way to do this creates a new String for every cell value when parsing. If you have a file with 10,000 rows and 7 columns, that means 70,000 heap allocations. I wanted to find out if a zero copy approach would be faster. In zero copy, parsed rows borrow from a backing store instead of owning their data. I wanted to know if this would give me real performance improvements, and at what file size those improvements would actually matter.
The answer depends completely on how big your dataset is. For the ATO files I actually use (usually 100 to 200 rows), zero copy parsing is actually slower because of abstraction overhead. The optimization only helps when you get to around 10,000+ rows. This post shows you how I built it and what the benchmarks told me.
The Standard Approach
The normal way to parse rows looks like this:
pub struct ParsedRow {
pub row_index: usize,
pub processed_date: Option<String>,
pub effective_date: Option<String>,
pub description: String, // Owned
pub debit: f64,
pub credit: f64,
pub running_balance: f64,
pub reference: Option<String>, // Owned
pub raw_data: Vec<String>, // Owned
}
Every String field needs a heap allocation. The parsing function goes through DataFrame columns and calls .to_string() on each cell:
pub fn parse_dataframe(df: &DataFrame, structure: &FileStructure) -> (Vec<ParsedRow>, ParsingStats) {
let columns: Vec<Vec<String>> = df
.get_columns()
.iter()
.map(|col| (0..df.height()).map(|i| extract_cell_value(col, i)).collect())
.collect();
// ... iterate and build ParsedRow with owned strings
}
The Zero Copy Alternative
The zero copy version uses a backing store that owns all the string data up front:
pub struct ColumnData {
columns: Vec<Vec<String>>, // columns[col_idx][row_idx]
num_rows: usize,
}
impl ColumnData {
pub fn from_dataframe(df: &DataFrame) -> Self {
let num_rows = df.height();
let columns: Vec<Vec<String>> = df
.get_columns()
.iter()
.map(|col| {
let mut col_data = Vec::with_capacity(num_rows);
for i in 0..num_rows {
col_data.push(extract_cell_value(col, i));
}
col_data
})
.collect();
Self { columns, num_rows }
}
#[inline]
pub fn get(&self, col_idx: usize, row_idx: usize) -> &str {
self.columns
.get(col_idx)
.and_then(|col| col.get(row_idx))
.map(String::as_str)
.unwrap_or("")
}
}
The parsed row then borrows from this store:
pub struct ParsedRowRef<'a> {
pub row_index: usize,
pub processed_date: Option<String>, // Still owned (date normalization)
pub effective_date: Option<String>, // Still owned
pub description: &'a str, // Borrowed
pub debit: f64,
pub credit: f64,
pub running_balance: f64,
pub reference: Option<&'a str>, // Borrowed
}
The lifetime 'a connects ParsedRowRef to the ColumnData it borrows from. The compiler makes sure that ColumnData lives longer than any ParsedRowRef instances. Note that processed_date and effective_date are still owned because date parsing involves normalization (converting Excel serial numbers, handling different date formats), which has to create new strings.
How Zero Copy Parsing Works
Here’s the zero copy parsing function:
pub fn parse_dataframe_zero_copy<'a>(
column_data: &'a ColumnData,
structure: &FileStructure,
) -> (Vec<ParsedRowRef<'a>>, ParsingStats) {
let mut parsed_rows = Vec::with_capacity(
column_data.num_rows().saturating_sub(structure.skip_rows)
);
for row_idx in structure.skip_rows..column_data.num_rows() {
if column_data.is_row_empty(row_idx) {
continue;
}
if let Some(parsed) = parse_row_zero_copy(column_data, row_idx, &structure.column_mapping) {
parsed_rows.push(parsed);
}
}
(parsed_rows, stats)
}
fn parse_row_zero_copy<'a>(
column_data: &'a ColumnData,
row_idx: usize,
mapping: &ColumnMapping,
) -> Option<ParsedRowRef<'a>> {
let get = |idx: Option<usize>| -> &'a str {
idx.map(|i| column_data.get(i, row_idx)).unwrap_or("")
};
// No allocations here, just references
let description = get(mapping.description);
let reference = mapping.reference.map(|i| column_data.get(i, row_idx));
Some(ParsedRowRef {
row_index: row_idx,
processed_date: normalize_date(get(mapping.processed_date)),
effective_date: normalize_date(get(mapping.effective_date)),
description,
debit: parse_amount(get(mapping.debit)),
credit: parse_amount(get(mapping.credit)),
running_balance: parse_amount(get(mapping.running_balance)),
reference: reference.filter(|s| !s.is_empty()),
})
}
You only convert to owned data when you actually need it:
impl<'a> ParsedRowRef<'a> {
pub fn into_owned(&self, column_data: &ColumnData) -> ParsedRow {
ParsedRow {
row_index: self.row_index,
processed_date: self.processed_date.clone(),
effective_date: self.effective_date.clone(),
description: self.description.to_string(),
debit: self.debit,
credit: self.credit,
running_balance: self.running_balance,
reference: self.reference.map(|s| s.to_string()),
raw_data: column_data.row_iter(self.row_index)
.map(|s| s.to_string())
.collect(),
}
}
}
Benchmark Results
I tested three different dataset sizes: 126 rows (actual ATO file), 2,000 rows, and 20,000 rows.
126 rows (real production file size):
Standard parsing: 126 rows in 4.13ms
Zero copy parsing: 126 rows in 9.66ms
2,000 rows:
Standard parsing: 1999 rows in 49.59ms
Zero copy parsing: 1999 rows in 58.81ms
20,000 rows:
Standard parsing: 19999 rows in 422.49ms
Zero copy parsing: 19999 rows in 393.56ms
Zero copy is 1.07x faster
The crossover point is somewhere between 2,000 and 20,000 rows. At smaller sizes, the extra work of going through ColumnData::get() and tracking lifetimes costs more than you save from allocations. Rust’s allocator handles small, short lived strings really well. 70,000 allocations sounds like a lot, but modern allocators are good at this.
Where Zero Copy Actually Helps
The real win for zero copy shows up in filtering workflows. If you only need some of the rows, you don’t allocate memory for rows you’re going to throw away:
let column_data = ColumnData::from_dataframe(&df);
let (rows, _) = parse_dataframe_zero_copy(&column_data, &structure);
// Filter without allocating
let large_debits: Vec<_> = rows.iter()
.filter(|r| r.is_debit() && r.debit > 100.0)
.collect();
// Allocate only for the 916 rows we actually need
let owned: Vec<ParsedRow> = large_debits.iter()
.map(|r| r.into_owned(&column_data))
.collect();
In the 20,000 row benchmark, this converted 9,584 matching rows instead of all 19,999. The normal approach would allocate for all rows, then filter, then throw away what you don’t need.
Conclusion
For my actual use case (parsing ATO ICA files that rarely go over a few hundred rows), the zero copy implementation doesn’t help. I’ll keep it in the library for users who have bigger datasets, but the default API is still the normal owned string approach. The optimization is worth thinking about if you’re processing CSV exports with tens of thousands of rows, streaming data where memory pressure matters, or building pipelines that filter heavily before saving results.
The implementation cost is real. Lifetime parameters spread through your API, self referential patterns need careful design, and explaining ParsedRowRef<'a> to users is harder than explaining ParsedRow. For small datasets, the simpler approach wins on everything except theoretical purity.