First, let's see how an ORC file looks like.
Now some keywords used in above image and also in your question!
- Stripe - A chunk of data stored in ORC file. Any ORC file is divided into those chunks, called stripes, each sized 250 MB with index data, actual data and some metadata for actual data stored in that stripe.
- Compression - The compression codec used to compress the data stored. ZLIB is the default for ORC.
Index Data - includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
Row data - Actual data. Is used in table scans.
Stripe Footer - The stripe footer contains the encoding of each column and the directory of the streams including their location. To describe each stream, ORC stores the kind of stream, the column id, and the stream’s size in bytes. The details of what is stored in each stream depends on the type and encoding of the column.
Postscript - holds compression parameters and the size of the compressed footer.
- File Footer - The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.
Now! Talking about your output from orcfiledump.
- First is general information about your file. The name, location, compression codec, compression size etc.
- Stripe statistics will list all the stripes in your ORC file and its corresponding information. You can see counts and some statistics about Integer columns like min, max, sum etc.
- File statistics is similar to #2. Just for the complete file as opposed to each stripe in #2.
- Last part, the Stripe section, talks about each column in your file and corresponding index info for each of it.
Also, you can use various options with orcfiledump to get "desired" results. Follows a handy guide.
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump]
[--backup-path <new-path>] <location-of-orc-file-or-directory>
Follows a quick guide to the options used in the commands above.
- Specifying -d in the command will cause it to dump the ORC file data
rather than the metadata (Hive 1.1.0 and later).
- Specifying --rowindex with a comma separated list of column ids will
cause it to print row indexes for the specified columns, where 0 is
the top level struct containing all of the columns and 1 is the first
column id (Hive 1.1.0 and later).
- Specifying -t in the command will print the timezone id of the
- Specifying -j in the command will print the ORC file metadata in JSON
format. To pretty print the JSON metadata, add -p to the command.
- Specifying --recover in the command will recover a corrupted ORC file
generated by Hive streaming.
- Specifying --skip-dump along with --recover will perform recovery
without dumping metadata.
- Specifying --backup-path with a new-path will let the recovery tool
move corrupted files to the specified backup path (default: /tmp).
- is the URI of the ORC file.
- is the URI of the ORC file or
directory. From Hive 1.3.0 onward, this URI can be a directory
containing ORC files.
Hope that helps!