Need of Inflight Data Formatting for ELT Data Pipelines

The ELT process has transformed data pipelines. ELT fully leverages the benefits of cloud service and remodeled the data pipeline process to emerge as faster and simpler.

As data storage and computation became affordable with the rise of cloud data warehouses, data teams can now load data in its original form and perform transformation afterward at the data warehouse using the ELT process.

ELT has emerged as a preferred technique for setting up a data pipeline over the traditional ETL process, where data loading was slow because of complex computation within the pipeline. Using ELT data pipelines, like Hevo, data teams can load high volumes of data easily and quickly. And deliver access to fresh and integrated data to analysts.

However, there is a caveat in the ELT process. The data loaded into the warehouse may not be consistent, organized, and as per the data warehouse tables format. As different data sources may store data in different formats.

Analysts must run additional computations post loading the data to make data consistent and prepare data for analysis. At Hevo, we believe it’s a better practice to format and clean the data for the warehouse before loading it.

Data teams should be equipped to apply lightweight data formatting logic and formulas on the fly before ingesting data into the warehouse. All these formulas should run without any impact on the loading speed. This will provide an additional boost to the ELT process and further fasten the analytics process.

Thus, we innovated and launched an inflight data formatter on Hevo Data Pipelines, where data teams can perform lightweight data formatting on the fly, and data is formatted for the warehouse while loading into it.

What is Inflight Data Formatting?

Inflight Data Formatting is an ELT data pipeline feature conceptualized by Hevo i.e. implementing lightweight data formatting within the pipeline while loading the data to the warehouse. The objective of inflight data formatting is to clean and standardize data on the fly without impacting the load performance of the pipeline.

We believe inflight data formatting is essential in modern ELT data pipelines and should be available in all modern data pipeline tools.

Data teams can maintain a consistent format of data at data warehouses by cleaning, enriching, or standardizing data on the fly in the pipeline while loading to the data warehouse.

With inflight data formatting, data teams can format data of different formats from different files or sources into a consistent format in each file’s respective pipelines while loading the data to the warehouse. Thus, the warehouse will have consistent and formatted data that can be instantly deployed to run complex analytics queries and build dashboards.

Case story on how ELT data pipelines work with inflight data formatting

Consider a global e-commerce company with multiple versions of their website like .us, .au for each location. They want to build a dashboard to analyze accumulated and region-wise traffic on each product page.

They would need to load the data for each region from Google Analytics to their data warehouse while adding the respective location name for each record. This can be easily achieved by creating a pipeline with inflight data formatting for each property on Google Analytics.

Using Inflight data formatting on Hevo, add a field ‘Country’ and populate it with the respective country name for each region. For a pipeline loading data from .au, set a formatting rule to add a new field ‘Country’ with populated value ‘Australia’.

With this, analysts will have access to complete data of website traffic for each region with their respective location names in the data warehouse. Thus, they can directly run queries to build a dashboard for product pages while adding country as a dimension.

Which functions to apply using inflight data formatting?

Most of the basic and lightweight data transformation or formatting functions are available in Hevo’s inflight data formatter. It enables data teams to enrich, split, merge, or normalize data.

Following are the most commonly used inflight formatting functions on Hevo that your data team should consider to perform inflight,

1. JSON Normalization

The process of converting or flattening JSON objects from NoSQL databases like MongoDB into a relational database can be done inflight.

Functions like parsing JSON packages, formatting JSON objects into rows and columns, loading nested data into tables, and setting up relationships between those tables are better to be performed on the fly using inflight data formatting.

All these functionalities can be automated in the pipeline itself while loading the data without impacting the loading performance. Thus, saving a lot of post-load transformation time.

2. Data and Time Formatting

Multiple data sources record dates and times in different formats. However, for analysis, it’s efficient to maintain a single format in your data warehouse. This formatting can be set up within the pipeline itself.

For example, you want to change the date format for data from Google Analytics to YYYY-MM-DD or you want to format the time variable for Avg time on page to HH-MM-SS. All these can be handled while loading your data to the warehouse itself.

3. Mask or Hash Data

Your data could include a few sensitive fields like your customers’ personal or contact information. This information should not be accessible by business users within your company to comply with data security laws.

Thus, you can mask or hash fields like email address, contact number, etc., on the fly.

4. Clean and Filter Events

There are a few events or fields that you would not want to load to your data warehouse. You can filter out such fields or events in the pipeline itself using an inflight data formatter.

If you have a product database, and you can filter out a few events of invalid or out-of-stock products, then you can write if-else code for it.

If you have a customer database, you can filter out a few events of customers who have not made a purchase or customers who have all canceled orders.

5. Data Enrichment

You can enrich your data by adding new fields on the fly. For example, if you have different datasets and respective pipelines for each product, then you can add the product name for each dataset while loading it into the data warehouse.

Benefits of Inflight Data Formatting

Data teams prefer to format their data for the data warehouse before loading as it has many advantages on data management and analytics.

Following are the key benefits,

1. Faster Analytics Process

The most significant advantage of inflight data formatting is that it further fastens the analytics process.

Formatting your data before loading it to the data warehouse saves time as it eliminates the need to run additional transformation models and workflows on your data to format it.

Setting up an ELT data pipeline with inflight data formatting is the fastest way to move data for analytics. It enables a robust and efficient analysis process.

2. Data Consistency

With inflight formatting, data from different data sources is stored in a single format in a data warehouse. It helps your data team to maintain data consistency at the data warehouse.

Data teams can easily solve the problem of data-type mismatch at the data warehouse and format all the legacy data types for modern cloud data warehouses.

Thus, providing analysts access to consistent and organized data.

3. Fully Automated

Most importantly, your data team can fully automate the process of formatting the data in your pipeline for your warehouse. It ensures the data is always compatible with the data warehouse tables and is ready to use for analysis. It saves a lot of time for your data team.

Power and Flexible Inflight Data Formatter on Hevo

Hevo provides a flexible and powerful console to set up inflight data formatting. There are 2 ways to set up inflight data formatting on Hevo, one is by using our drag-and-drop interface, and the other is by using the Python console.

Using a drag-and-drop interface, you can effortlessly set up required formatting in minutes without writing any code.

You can set up and write your own rule or logic for data formatting using the Python console.

Here is what our customers say about the inflight data formatter

The ability to transform data using python code, automap object data, and do this all in a user-friendly interface is wonderful.

Alex Rhomberg, Data Analyst, Eagle Point

Hevo does a lot of heavy lifting for us: Time Stamping, JSON flattening, Table Splitting among other transformations.

Samvit Majumdar, Principal Engineer, Whatfix

Try out Inflight Data Formatter with Hevo

Check out how inflight data formatting further optimizes your data pipelines and data stack by signing up for a 14-day free trial.

If you are a Hevo customer and you want to set up inflight data formatting then you can simply go to your respective pipeline’s transformation section and set up your respective rule or function.


Source link

Tags: No tags

Add a Comment

Your email address will not be published. Required fields are marked *