A Data Challenge
I was having a discussion with someone at work and we were talking about event level data. If you are not familiar with event level data it can look something like this:
GroupID | EventID | EventName | Timestamp | UserID | Details |
---|---|---|---|---|---|
G001 | E101 | Login | 2025-07-30 09:15:23 | U001 | Successful login |
G001 | E102 | View Dashboard | 2025-07-30 09:16:10 | U001 | Loaded user dashboard |
G001 | E103 | Download Report | 2025-07-30 09:18:44 | U001 | Report ID: R567 |
G002 | E201 | Login | 2025-07-30 10:02:11 | U002 | Failed password (2 attempts) |
G002 | E202 | Password Reset | 2025-07-30 10:04:55 | U002 | Sent reset email |
G002 | E203 | Login | 2025-07-30 10:10:32 | U002 | Successful login |
G003 | E301 | Add to Cart | 2025-07-30 11:20:14 | U003 | Item ID: P908 |
G003 | E302 | Checkout | 2025-07-30 11:25:00 | U003 | Order ID: O12345 |
G003 | E303 | Payment Success | 2025-07-30 11:27:10 | U003 | Payment via credit card |
Data like this can quickly balloon into a massive amount of records very quickly. Let’s assume that every interaction has 6 events inside of it and lets imagine you had 6 billion records. What if you could compress that down to 1 billion and represent every interaction as a single record? It sounded like a cool idea, but I still was not sure if it was actually more efficient or not.
When is this more efficient?
Data Storage:
- This is definitely smaller since repeated fields (GroupID, UserID) are stored once, and columnar/JSON compression can shrink it further.
ETL (processing):
- This would have faster to load/write since there are fewer records (1 per group, not N per event).
Querying sessions or interactions as a whole:
- If most queries pull all events for a group, this structure is optimal.
When is this less efficient?
-
If you need to filter by event attributes often (“count all Login events across all groups”), you’ll have to explode/flatten the arrays in most databases, which can slow queries.
-
Some SQL databases handle arrays poorly; it may work best in a document store (MongoDB) or columnar database (Snowflake, BigQuery, ClickHouse) that supports arrays natively.
What is the best option?
Columnar Warehouses with array support (Snowflake, BigQuery, ClickHouse)
- Store as a table but use ARRAY columns.
- Querying by session is fast, and they can efficiently unnest arrays when needed.
Is this useful?
I would say possibly. In the case of something like Snowflake, which can utilize array columns, you could bring whatever data you need through framezip and prepare it for Snowflake. When loading it into Snowflake you only need to do something like this:
Push the data processed with framezip to parquet:
packed.to_parquet("sessions.parquet")
Load the parquet into Snowflake and define the array columns:
CREATE OR REPLACE TABLE sessions_raw (
GroupID STRING,
EventID ARRAY,
Value ARRAY
);
COPY INTO sessions_raw
FROM @my_stage/sessions.parquet
FILE_FORMAT = (TYPE = PARQUET);
In the end
In the end it was cool to create and publish a module to PyPi. It always feels great to have a random idea and execute on it. It was a lot of fun and I learned a lot through the entire process and to be honest that is all that really matters.
You can find the module on PyPi here.
Below is a link to framezip on Github:
Cheers, Keith