A Data Challenge

I was having a discussion with someone at work and we were talking about event level data. If you are not familiar with event level data it can look something like this:

GroupID EventID EventName Timestamp UserID Details
G001 E101 Login 2025-07-30 09:15:23 U001 Successful login
G001 E102 View Dashboard 2025-07-30 09:16:10 U001 Loaded user dashboard
G001 E103 Download Report 2025-07-30 09:18:44 U001 Report ID: R567
G002 E201 Login 2025-07-30 10:02:11 U002 Failed password (2 attempts)
G002 E202 Password Reset 2025-07-30 10:04:55 U002 Sent reset email
G002 E203 Login 2025-07-30 10:10:32 U002 Successful login
G003 E301 Add to Cart 2025-07-30 11:20:14 U003 Item ID: P908
G003 E302 Checkout 2025-07-30 11:25:00 U003 Order ID: O12345
G003 E303 Payment Success 2025-07-30 11:27:10 U003 Payment via credit card

Data like this can quickly balloon into a massive amount of records very quickly. Let’s assume that every interaction has 6 events inside of it and lets imagine you had 6 billion records. What if you could compress that down to 1 billion and represent every interaction as a single record? It sounded like a cool idea, but I still was not sure if it was actually more efficient or not.

When is this more efficient?

Data Storage:

ETL (processing):

Querying sessions or interactions as a whole:

When is this less efficient?

What is the best option?

Columnar Warehouses with array support (Snowflake, BigQuery, ClickHouse)

Is this useful?

I would say possibly. In the case of something like Snowflake, which can utilize array columns, you could bring whatever data you need through framezip and prepare it for Snowflake. When loading it into Snowflake you only need to do something like this:

Push the data processed with framezip to parquet:

packed.to_parquet("sessions.parquet")

Load the parquet into Snowflake and define the array columns:

CREATE OR REPLACE TABLE sessions_raw (
  GroupID STRING,
  EventID ARRAY,
  Value ARRAY
);

COPY INTO sessions_raw
FROM @my_stage/sessions.parquet
FILE_FORMAT = (TYPE = PARQUET);

In the end

In the end it was cool to create and publish a module to PyPi. It always feels great to have a random idea and execute on it. It was a lot of fun and I learned a lot through the entire process and to be honest that is all that really matters.

You can find the module on PyPi here.

Below is a link to framezip on Github:

Cheers, Keith