Parquet Row Groups, Column Chunks, Pages, And How Trino Reads Them
Parquet is the storage-format layer that sits between Iceberg’s selected data files and Trino’s in-memory execution batches.
For a Trino scan, the useful read shape is:
read file metadata
-> choose useful row groups
-> read useful columns
-> read encoded and compressed Parquet pages
-> decode batches into Trino blocks
-> pass Trino Page objects to operators
This note is the storage-format bridge before tracing a real Iceberg read query.
The previous note explained how Iceberg metadata points to data files. This one
explains what happens after a selected data file is Parquet: how row groups,
column chunks, pages, encodings, and Trino Page objects fit together.
1. Row-Based vs Columnar Storage
Start with a tiny table:
| id | name | age | country |
|---|---|---|---|
| 1 | Alice | 20 | US |
| 2 | Bob | 35 | CA |
| 3 | Chris | 28 | US |
A row-based file stores complete rows together:
1, Alice, 20, US
2, Bob, 35, CA
3, Chris, 28, US
A columnar file stores values by column:
id: 1, 2, 3
name: Alice, Bob, Chris
age: 20, 35, 28
country: US, CA, US
For this query:
SELECT avg(age)
FROM users;
a columnar format can focus on the age column. It does not need to read
id, name, or country just to compute the average.
That is the first Parquet idea to keep:
Parquet is useful to Trino because analytical queries often need some columns,
not every column in every row.
2. The Parquet File Layout
A Parquet file has a nested layout:
Parquet file
row group 1
column chunk: id
data pages
column chunk: name
data pages
column chunk: age
data pages
column chunk: country
optional dictionary page
data pages
row group 2
column chunk: id
column chunk: name
column chunk: age
column chunk: country
footer metadata
The vocabulary:
| Term | Meaning |
|---|---|
| Parquet file | One physical data file. |
| Row group | A horizontal batch of rows inside the file. |
| Column chunk | Values for one column inside one row group. |
| Parquet page | Smaller encoded and compressed storage unit inside a column chunk. |
| Footer metadata | File metadata, schema, row group metadata, column statistics, and offsets. |
The two dimensions matter:
row group:
horizontal slice of rows
column chunk:
vertical slice for one column inside that row group
That is why Parquet can support both:
row-group pruning:
skip row batches that cannot match a filter
column pruning:
avoid reading columns the query does not need
3. Row Groups
A row group is a batch of rows.
For a file with one million rows, the layout might be:
users.parquet
row group 1: rows 1 - 100,000
row group 2: rows 100,001 - 200,000
row group 3: rows 200,001 - 300,000
...
Inside each row group, values are stored by column:
row group 1
id column chunk
name column chunk
age column chunk
country column chunk
row group 2
id column chunk
name column chunk
age column chunk
country column chunk
A row group is large enough to make metadata useful, but smaller than the whole file. That makes it a practical skipping boundary.
4. Row Group Skipping
Parquet row groups often have statistics for columns. The most useful beginner example is min/max.
Imagine the age column has these row group statistics:
row group 1: age min=10, max=25
row group 2: age min=26, max=45
row group 3: age min=46, max=80
For this query:
SELECT *
FROM users
WHERE age > 50;
Trino can reason:
row group 1:
max age is 25, so skip
row group 2:
max age is 45, so skip
row group 3:
max age is 80, so rows may match
This does not mean Parquet knows the final SQL answer. It means the reader can avoid work for row groups that cannot possibly produce matching rows.
The useful distinction:
statistics can prove "no rows here can match"
statistics often cannot prove "all rows here match"
So Trino may still evaluate a filter after reading a row group that might contain matching rows.
5. Column Chunks And Parquet Pages
A row group contains one column chunk per column.
For this row group:
row group 1: rows 1 - 100,000
the file has separate column chunks:
id column chunk: id values for rows 1 - 100,000
name column chunk: name values for rows 1 - 100,000
age column chunk: age values for rows 1 - 100,000
country column chunk: country values for rows 1 - 100,000
Each column chunk is split into smaller Parquet pages:
country column chunk
dictionary page: unique country values, if dictionary encoding is used
data page 1: encoded values for some rows
data page 2: encoded values for more rows
data page 3: encoded values for more rows
The row group is the larger pruning unit. The Parquet page is the smaller encoded/compressed storage unit inside a column chunk.
That matters because the word page will appear again in Trino. A Parquet page
and a Trino Page are different things.
6. Encoding vs Compression
Encoding and compression are related, but they are not the same.
Encoding changes how values are represented.
Compression shrinks the encoded bytes.
The write-side shape is:
original values
-> encode values into a compact representation
-> compress the encoded bytes
-> write bytes to the Parquet file
The read-side shape is:
read compressed bytes
-> decompress bytes
-> decode values or ids
-> produce in-memory batches for the engine
Two common encoding ideas are dictionary encoding and run-length encoding.
7. Dictionary Encoding
Dictionary encoding is useful when a column has repeated values.
Example:
country:
US
US
CA
US
CA
MX
US
Instead of storing every string repeatedly, Parquet can store unique values:
dictionary:
0 -> US
1 -> CA
2 -> MX
Then the data can store ids:
0, 0, 1, 0, 1, 2, 0
The compact shape is:
dictionary values + integer ids
This is useful for string-like columns with repeated values:
country
status
event_type
tenant_id
category
8. Run-Length Encoding
Run-length encoding, or RLE, is useful when the same value or same small id appears many times in a row.
Example:
status:
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
INACTIVE
INACTIVE
ACTIVE
ACTIVE
ACTIVE
A simple RLE representation is:
ACTIVE x 5
INACTIVE x 2
ACTIVE x 3
In Parquet, RLE is especially useful for compact streams of small integers, such as dictionary ids and the levels used for null or nested data.
Dictionary encoding and RLE can work together:
country:
US
US
US
US
CA
CA
CA
MX
MX
dictionary:
0 -> US
1 -> CA
2 -> MX
dictionary ids:
0, 0, 0, 0, 1, 1, 1, 2, 2
RLE over ids:
0 x 4
1 x 3
2 x 2
The important point is not the exact low-level encoding. The useful point is that Parquet can store repeated values compactly, and Trino can often preserve compact shapes while processing batches.
9. Nulls And Nested Data
Parquet also needs to represent nulls and nested structures.
For a nullable column:
nickname:
Alice
null
Chris
the file needs to distinguish:
position 0: value exists
position 1: value is null
position 2: value exists
Parquet uses definition levels for this kind of information. A definition level is a small number that says how much of the value path exists.
For repeated nested data, such as arrays and maps, Parquet also uses repetition levels. Repetition levels help describe whether a value starts a new parent row or continues a repeated field.
The compact mental model:
definition level:
is the value present, or is some part null?
repetition level:
for arrays and maps, does this continue the same parent row?
These levels are usually small repeated numbers, so they are good candidates for compact RLE and bit-packed encoding.
10. How Trino Fits
In Trino terms:
split:
a unit of read work assigned to a worker; often covers a file or file range
connector page source:
the connector object that reads data and produces pages for the engine
Parquet reader:
the format reader that reads Parquet metadata and page data
Parquet row group:
a storage-format batch of rows with column chunks and statistics
Parquet page:
encoded and compressed storage unit inside a column chunk
Trino Block:
in-memory column-shaped batch of values
Trino Page:
in-memory batch of rows; one block per selected column
The word page is overloaded:
Parquet page:
encoded and compressed bytes inside a file
Trino Page:
in-memory batch passed between operators
They are related by the reader path, but they are not the same object.
Parquet pages
-> decompression
-> decoding
-> Trino blocks
-> Trino Page
11. A Simple Trino Read Shape
For this query:
SELECT name
FROM users
WHERE country = 'US';
Trino mainly needs:
country:
needed for the filter
name:
needed for the output
id:
not needed
age:
not needed
The scan flow is closer to:
SQL query
-> split assigned to a worker
-> connector page source reads Parquet metadata
-> skip impossible row groups
-> read needed column chunks
-> read compressed Parquet pages
-> decompress page bytes
-> decode values into Trino blocks
-> build Trino Page objects
-> apply operators such as filter, project, join, aggregation
-> output pages
For a dictionary-encoded country column, the file might contain:
dictionary:
0 -> US
1 -> CA
2 -> MX
ids:
0, 0, 1, 0, 2, 1
Trino may keep a similar compact shape in memory:
dictionary block:
dictionary values: US, CA, MX
ids: 0, 0, 1, 0, 2, 1
The filter can identify matching positions:
position 0: id 0 -> US -> keep
position 1: id 0 -> US -> keep
position 2: id 1 -> CA -> drop
position 3: id 0 -> US -> keep
position 4: id 2 -> MX -> drop
position 5: id 1 -> CA -> drop
If the query also returns name, Trino uses those kept positions to keep the
matching names from the name block.
Some operations can preserve compact block shapes. Other operations materialize new values. The important rule is:
Trino works in columnar batches, not row-by-row objects.
12. A Footer Check From The Demo File
For the demo file, I exported Parquet footer metadata as CSV from a Parquet metadata viewer. This is better evidence than a row-preview screenshot because it shows the storage layout behind the decoded rows.
The CSV has one row per row-group column chunk:
100 row groups * 4 columns = 400 metadata rows
The inspected columns are:
orderkey
custkey
orderstatus
totalprice
The useful footer facts:
| Fact | Observed value |
|---|---|
| Row groups | 100 |
| Columns per row group | 4 |
| Rows per row group | mostly 48 to 55 |
| Compression | ZSTD |
orderstatus encoding |
PLAIN_DICTIONARY, BIT_PACKED, RLE |
Side note: I can inspect the same kind of footer metadata from the command line
too. For a local file, parquet-tools inspect <file.parquet> shows the schema,
row group count, columns, physical/logical types, compression, and related file
metadata without uploading the file anywhere.
Here is a narrowed excerpt from the CSV export. It includes every column chunk
row for row groups 0 and 1, then keeps only the totalprice rows for row
groups 2 through 4. The ... rows stand in for the other column chunks in
those row groups.
| Row group | Rows | Column | Type | Min | Max | Compression | Encodings |
|---|---|---|---|---|---|---|---|
| 0 | 50 | orderkey |
INT64 |
1 |
50 |
ZSTD |
PLAIN, BIT_PACKED, RLE |
| 0 | 50 | custkey |
INT64 |
1000 |
1009 |
ZSTD |
PLAIN_DICTIONARY, BIT_PACKED, RLE |
| 0 | 50 | orderstatus |
BYTE_ARRAY |
F |
P |
ZSTD |
PLAIN_DICTIONARY, BIT_PACKED, RLE |
| 0 | 50 | totalprice |
DOUBLE |
57.25 |
412.5 |
ZSTD |
PLAIN, BIT_PACKED, RLE |
| 1 | 50 | orderkey |
INT64 |
51 |
100 |
ZSTD |
PLAIN, BIT_PACKED, RLE |
| 1 | 50 | custkey |
INT64 |
1000 |
1009 |
ZSTD |
PLAIN_DICTIONARY, BIT_PACKED, RLE |
| 1 | 50 | orderstatus |
BYTE_ARRAY |
F |
P |
ZSTD |
PLAIN_DICTIONARY, BIT_PACKED, RLE |
| 1 | 50 | totalprice |
DOUBLE |
419.75 |
775.0 |
ZSTD |
PLAIN, BIT_PACKED, RLE |
| 2 | … | ... |
... |
... |
... |
... |
... |
| 2 | 50 | totalprice |
DOUBLE |
782.25 |
1137.5 |
ZSTD |
PLAIN, BIT_PACKED, RLE |
| 3 | … | ... |
... |
... |
... |
... |
... |
| 3 | 50 | totalprice |
DOUBLE |
1144.75 |
1500.0 |
ZSTD |
PLAIN, BIT_PACKED, RLE |
| 4 | … | ... |
... |
... |
... |
... |
... |
| 4 | 55 | totalprice |
DOUBLE |
1507.25 |
1898.75 |
ZSTD |
PLAIN, BIT_PACKED, RLE |
The orderstatus line is a good concrete example for this note. It is a
repeated string column with values like F, O, and P, and the footer shows
dictionary-style encoding:
path_in_schema: orderstatus
type: BYTE_ARRAY
compression: ZSTD
encodings: PLAIN_DICTIONARY, BIT_PACKED, RLE
stats_min_value: F
stats_max_value: P
The numeric columns show row-group min/max ranges. For totalprice, the first
few row groups look like:
| Row group | Min totalprice |
Max totalprice |
|---|---|---|
| 0 | 57.25 |
412.5 |
| 1 | 419.75 |
775.0 |
| 2 | 782.25 |
1137.5 |
| 3 | 1144.75 |
1500.0 |
| 4 | 1507.25 |
1898.75 |
That is the kind of metadata a reader can use for pruning. If a query asks for:
WHERE totalprice > 1000
then the first two row groups cannot match based on their max value. Later row groups may match, so Trino still needs to read and evaluate rows from those groups.
This is enough evidence for the post:
row groups:
visible through row_group_id and row_group_num_rows
column chunks:
one metadata row per column inside each row group
statistics:
visible through stats_min_value and stats_max_value
encoding and compression:
visible through encodings and compression
13. Where This Sits With Iceberg
The previous post described the Iceberg metadata chain:
catalog
-> current metadata.json
-> current snapshot
-> manifest list
-> manifest file
-> data file
That chain chooses data files.
Parquet starts after a selected data file is opened:
Parquet file
-> footer metadata
-> row groups
-> column chunks
-> Parquet pages
The layers are separate:
Iceberg metadata pruning:
skip manifests or whole data files
Parquet pruning:
skip row groups, column chunks, or page ranges inside a selected file
Trino execution:
process decoded batches as Page and Block objects
This distinction matters when reading EXPLAIN and EXPLAIN ANALYZE.
EXPLAIN can show that a constraint reached the table scan. It does not, by
itself, prove exactly how many Parquet row groups or pages were skipped at
runtime.
For runtime proof, I need to combine:
EXPLAIN:
planned scan shape and constraints
EXPLAIN ANALYZE:
runtime rows, physical input, splits, and operator stats
Iceberg metadata tables:
which files and partitions were candidates
Parquet/file-level evidence:
row group and page-level pruning behavior, when needed
That proof pattern belongs more naturally in the next read-trace note. This note is just the vocabulary needed before that trace.
14. What To Remember
- Parquet is columnar, so Trino can read useful columns instead of every column.
- A Parquet file is organized into row groups, column chunks, and Parquet pages.
- Row groups can carry statistics such as min/max values.
- Row-group statistics can prove that a row group cannot match a predicate.
- Column chunks let the reader avoid unneeded columns.
- Parquet pages are encoded and compressed storage units.
- Encoding changes representation; compression shrinks bytes.
- Dictionary encoding and RLE help repeated values stay compact.
- A Parquet page is not a Trino
Page. - Trino
Pageobjects are in-memory batches made ofBlockobjects. - Iceberg metadata pruning and Parquet row-group/page pruning are different layers.
15. Self-Check
Questions to answer without looking back:
- Why is Parquet useful for analytical queries?
- What is the difference between a row group and a column chunk?
- What is stored in a Parquet page?
- Why can min/max statistics let Trino skip a row group?
- Why does row-group skipping not replace SQL filter evaluation in every case?
- What is the difference between encoding and compression?
- Why is dictionary encoding useful for repeated string-like columns?
- What is RLE good at representing?
- Why is a Parquet page different from a Trino
Page? - Which layer skips whole data files: Iceberg metadata or Parquet metadata?
- Which layer produces
BlockandPageobjects for Trino operators?
16. References
- Apache Parquet concepts
- Trino concepts
- Trino Iceberg connector
- Parquet Viewer
- Chatdb Parquet Metadata Reader