Apache Kafka to BigQuery: 2 Easy Methods

Organizations today have access to a wide stream of data. Data is generated from recommendation engines, page clicks, internet searches, product orders, and more. It is necessary to have an infrastructure that would enable you to stream your data as it gets generated and carry out analytics on the go. To aid this objective, incorporating a data pipeline from Apache Kafka to BigQuery is a step in the right direction. 

Table of Contents

What is Apache Kafka?

Kafka to BigQuery: Kafka Logo | Hevo Data
Image Source

Apache Kafka is an open-source distributed event streaming platform. It provides a reliable pipeline to process data generated from various sources, sequentially and incrementally. Kafka handles both online and offline data consumption as the ingested data is persisted on disk and replicated within central clusters to prevent data loss. Kafka runs on a distributed system that is split into multiple running machines that work together in a single cluster. Apache Kafka provides its uses with use cases such as:

  • Publish and subscribe to streams of records
  • Store streams of records in a fault-tolerant way
  • Process streams of records as they occur
  • Provide a framework to develop a logic to perform analytics across streams of data using Kafka streams.

Kafka is usually used to build real-time data streaming pipelines and data streaming applications that adapt to data streams.

To learn more about Apache Kafka, visit here.

What is Google BigQuery?

Kafka to BigQuery: BigQuery Logo | Hevo Data
Image Source

BigQuery is a scalable and fully managed data warehouse built by Google that runs super-fast SQL queries. BigQuery’s architecture is built on top of Dremel technology. Dremel is Google’s interactive ad-hoc query system for the analysis of read-only nested data.

It analyses data on a massive scale and runs a fully serverless system that abstracts you from managing any form of infrastructure, hence you are given the liberty to focus mainly on analytics. BigQuery provides a partitioning model that allows us to choose how you want your ingested data to be queried. The partitioning model is based on the concepts below:

  • Processing Time: This partition model is based on the time an event was observed usually the ingestion date.
  • Event Time: In this case, the table is partitioned based on one of the TIMESTAMP/DATE fields on the incoming record.

These partitions allow us to avoid expensive and time-consuming full scans as you’d only pay for the period queried. BigQuery provides support for both batch and stream loading data ingestion methods. You can read more on BigQuery here

Now that we have covered some background information concerning both Apache Kafka and Google BigQuery, next up let us look at the options we have to load data from Kafka to BigQuery.

Method 1: Using Custom Code to Move Data from Kafka to BigQuery
In this method, users have to write codes to enable two processes, streaming data from Kafka and ingesting data into BigQuery. This method is suitable for users with a technical background.

Method 2: Using Hevo Data, to Move Data from Kafka to BigQuery

Hevo Data, an Automated Data Pipeline, provides you a hassle-free solution to connect Kafka to BigQuery within minutes with an easy-to-use no-code interface. Hevo is fully managed and completely automates the process of not only loading data from Kafka but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.

Hevo’s fault-tolerant Data Pipeline offers a faster way to move data from databases or SaaS applications into your BigQuery account. Hevo’s pre-built integration with Kafka along with 100+ other data sources (including 40+ free data sources) will take full charge of the data transfer process, allowing you to focus on key business activities.

Methods to Set up Kafka to BigQuery Connection

You can easily set up your Kafka to BigQuery connection using the following 2 methods:

Method 1: Using Custom Code to Move Data from Kafka to BigQuery

The steps to build a custom-coded data pipeline between Apache Kafka and BigQuery is divided into 2, namely:

Step 1: Streaming Data from Kafka

There are various methods and open-source tools which can be employed to stream data from Kafka. This blog covers the following methods:

  • Streaming with Kafka Connect
  • Streaming with Apache Beam

Streaming with Kafka Connect

Kafka Connect is an open-source component of Kafka. It is designed by Confluent for the purpose of connecting Kafka with external systems such as databases, key-value stores, file systems et al.

It allows users to stream data from Kafka straight into BigQuery with sub-minute latency through its underlying framework. Kafka connect gives users the incentive of making use of existing connector implementations so you don’t need to draw up new connections when moving new data. Kafka Connect provides a ‘SINK’ connector that continuously consumes data from consumed Kafka topics and streams to external storage location in seconds. It also has a ‘SOURCE’ connector that ingests databases as a whole and streams table updates to Kafka topics. 

There is no inbuilt connector for BigQuery in Kafka connect. Hence, you will need to use an open-source tool Wepay. When making use of this tool, BigQuery tables can be auto-generated from the AVRO schema seamlessly. The connector also aids dealing with schema updates. As BigQuery streaming is backwards compatible, it enables users to easily add new fields with default values and steaming will continue uninterrupted.

Using Kafka Connect, the data can be streamed and ingested into BigQuery in real-time. This, in turn, gives users the advantage to carry out analytics on the fly.

Limitations of streaming with Kafka Connect

  • In this method, data is partitioned only by the processing time.

Streaming Data with Apache Beam

Apache Beam is an open-source unified programming model that implements batch and stream data processing jobs that run on a single-engine. The Apache Beam model helps abstracts all the complexity of parallel data processing. This allows you to focus on what is required of your Job not how the Job gets executed.

One of the major downsides of streaming with Kafka connect is that it can only ingest data by the processing time which can lead to data arriving in the wrong partition. Apache Beam resolves this issue as it supports both batch and stream data processing. 

Beam has a supported distributed processing backend called Cloud Data flow that executes your code as a cloud job making it fully managed and auto-scaled. The number of workers is fully elastic as it changes according to your current workload and the cost of execution is altered concurrently. 

Limitations of Streaming Data with Apache Beam

  • Apache Beam incurs an extra cost for running managed workers
  • Apache Beam is not a part of the Kafka ecosystem.

Step 2: Ingesting Data to BigQuery

Before you start streaming in from Kafka to BigQuery, you need to check the following boxes:

  • Make sure you have the Write access to the dataset that contains your destination table to prevent subsequent errors when streaming.
  • Check the quota policy for streaming data on BigQuery to ensure you are not in violation of any of the policies.
  • Ensure that billing is enabled for your GCP(Google Cloud Platform) account. This is because streaming is not available for the free tier of GCP, hence if you want to stream data into BigQuery you have to make use of the paid tier.

Now, let us discuss the methods to ingest our streamed data from Kafka to BigQuery. The following approaches are covered in this post:

  • Streaming with BigQuery API
  • Batch Loading into Google Cloud Storage (GCS)

Streaming with BigQuery API

The BigQuery API is a data platform for users to manage, create, share and query data. It supports streaming data directly into BigQuery with a quota of up 100K rows per project. 

Real-time data streaming on BigQuery API costs $0.05 per GB. To make use of BigQuery API, it has to be enabled on your account. To enable the API:

  • Ensure that you have a project created.
  • In the GCP Console, click on the hamburger menu and select APIs and services and click on the dashboard.
Kafka to BigQuery: Google Cloud Platform | Hevo Data
Image Surce: Self
  • In the API and services window, select enable API and Services.
  • A search query will pop up. Enter Big Query. Two search results of Big Query Data Transfer and Big Query API will pop up. Select both of them and enable them.
Kafka to BigQuery: BigQuery API | Hevo Data
Image Source: Self

BigQuery API enabled, next would be to move the data from Apache Kafka through a stream processing framework like Kafka streams into BigQuery. Kafka stream is an open-source library for building scalable streaming applications on top of Apache Kafka. Kafka streams allow users to execute their code as a regular Java application. The pipeline flows from an ingested Kafka topic and some filtered rows through streams from Kafka to BigQuery. It supports both processing time and event time partitioning models. 

Limitations of Streaming with BigQuery API

  • Though streaming with the BigQuery API gives complete control over your records you have to design a robust system to enable it to scale successfully.
  • You have to handle all streaming errors and downsides independently.

Batch Loading Into Google Cloud Storage (GCS)

To use this technique you could make use of Secor. Secor is a tool designed to deliver data from Apache Kafka into object storage systems such as GCS and Amazon S3. From GCS we then load the data into BigQuery using either a load job, manually via the BigQuery UI, or through BigQuery’s command line Software Development Kit (SDK).

Limitations of Batch Loading in GCS

  • Secor lacks support for AVRO input format, this forces you to always use a JSON-based input format. 
  • This is a two-step process that can lead to latency issues. 
  • This technique does not stream data in real-time. This becomes a blocker in real-time analysis for your business. 
  • This technique requires a lot of maintenance in order to keep up with new Kafka topics and fields. To update these changes you would need to put in the effort to manually update the schema in the BigQuery table.

Method 2: Using Hevo Data to Move Data from Kafka to BigQuery

Kafka to BigQuery: Hevo Logo | Hevo Data
Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from Kafka to BigQuery in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Hevo takes care of all your data preprocessing needs required to set up Kafka to BigQuery Integrations and lets you focus on key business activities

Hevo provides a one-stop solution for all Kafka use cases and initializes a connection with Kafka Bootstrap Servers to collect the data stored in their Topics & Clusters. Moreover, Since Google BigQuery has built-in support for nested and repeated columns, Hevo neither split nor compress the JSON data, but based on the Source type, arrays may be collapsed into JSON strings or passed as-is to the Destination. You can also leverage Hevo’s Data Mapping feature to ensure that your Google BigQuery data is up-to-date.

Here are the steps to move data from Kafka to BigQuery using Hevo:

  • Authenticate Source: Configure Kafka as the source for your Hevo Pipeline by specifying Broker and Topic Names.
Kafka to BigQuery: Kafka Source | Hevo Data
Image Source
  • Configure Destination: Configure the Google BigQuery Data Warehouse account, where the data needs to be streamed, as your destination for the Hevo Pipeline.
Kafka to BigQuery: BigQuery Destination | Hevo Data
Image Source

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data from Kafka files and maps it to the destination schema.
  • Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use for aggregation.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

With continuous Real-Time data movement, Hevo allows you to combine Kafka data along with your other data sources and seamlessly load it to BigQuery with a no-code, easy-to-setup interface. Try our 14-day full-feature access free trial!

Get Started with Hevo for Free

Conclusion

This article provided you with a step-by-step guide on how you can set up Kafka to BigQuery connection using Custom Script or using Hevo. However, there are certain limitations associated with the Custom Script method. You will need to implement it manually, which will consume your time & resources and is error-prone. Moreover, you need full working knowledge of the backend tools to successfully implement the in-house Data transfer mechanism. You will also have to regularly map new S3 data to MySQL as the AWS Pipeline is not fully managed.

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 100+ data sources (including 40+ free sources) and can seamlessly transfer your data from Kafka to BigQuery within minutes. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner without having to write any code. It will make your life easier and make data migration hassle-free.

Learn more about Hevo

Want to take Hevo for a spin? Signup for a 14-day free trial and experience the feature-rich Hevo suite firsthand.

Share your understanding of the Kafka to BigQuery Connection in the comments below!


Source link

Tags: No tags

Add a Comment

Your email address will not be published. Required fields are marked *