Streaming data with Azure EventHub and Azure Stream Analytics
Azure Stream Analytics is a real-time analytics service that enables processing and analyzing of streaming data on-the-fly. It’s used to uncover insights from data as it’s being generated, making it invaluable for time-sensitive applications such as real-time monitoring, alerts, telemetry, anomaly detection, and live dashboards.
Azure Event Hub is a big data streaming platform and event ingestion service. It’s used for capturing, processing, and analyzing real-time data from multiple sources like applications, devices, or services.
For instance, you might have a task to transform and transfer log information from some Azure service like your Storage Account to Azure SQL DB. Back in the days we may need to manually convert a log file (log.csv) and upload the data into SQL Server. However, nowadays in a real-time scenario, companies would prefer an automated approach to handle daily activity logs. They would want the logs to be automatically transferred to a data lake and then use a service like Azure Data Factory to copy the delta data to a table.
In an actual scenario, we would want to gather data from multiple sources. Instead of manually specifying filters and exporting logs on a daily basis, we would want our data to be continuously streamed to a Data Lake. Azure Data Factory can then perform its tasks using the data from the Data Lake.
Let’s discuss an architecture. We require an intermediate service to consume data from the source, let’s say “Azure Activity logs”, and then copy it to an “Azure Data Lake Gen2” storage account. The initial step is to have the capability to ingest or retrieve the data from the source. In this case, I mentioned “Azure activity logs”, but in most companies, when it comes to their source systems, they may have log information, telemetry information, such as application metrics, which could be produced rapidly. Therefore, there could be a substantial volume of metrics and log information per minute that needs to be retrieved from these diverse sources.
The first step is to be able to retrieve all these data points. These data points can be sent as events to “Azure Event Hub”. The primary purpose of Azure Event Hub is to serve as a service that can retrieve or capture all these events in a centralized location from various source systems. As I mentioned earlier, we can manually uploaded files to Azure Data Lake. However, in order to consume data from multiple sources in real-time, an automated approach is desirable. This is where Azure Event Hubs can be utilized. Additionally, we may need a Consumer, which we will discuss later, to consume these events.
This time I will show you a Python program for sending events to Azure Event Hub.
… and one more code for receiving events from the Event Hub. This program actually extracts not only the body of the message, but also the special part of the message that is added by the Event Hub.
Let’s get a little deeper into how the Azure Event Hub works. So when I sent my data from my program onto Azure Event Hub along with the order
information that I was sent, Azure Event Hub also adding some other information for each event. It added information such as what is a partition ID, the data offset the sequence number, the partition key, et cetera.
So when it comes onto our Event Hub, by default there are two partitions (you can configure more partitions). Currently, when I sent this batch of data, it was only sent onto one partition and that’s partition ID zero.
So in our example we have two partitions, zero and one. If I were to run my program again for sending data it’ll probably send the data onto the second partition. So as we have producers of information, sending data onto Azure Event Hub can actually distribute that data across multiple partitions.
It’s like having multiple storage spaces in place. We could be looking at sending millions of events onto Azure Event Hub. By having multiple partitions in place you can actually write your data much more efficiently your better throughput.
In our case, I wrote simple Python programs that could send and read events from Azure Event Hub. However, in reality we have services that can send data to Azure Event Hub. It is important to note that:
- Azure Event Hub itself cannot send and receive data, it only serves as a hub that receives from someone and gives data to someone.
- Azure Event Hub is not a Database and cannot act as a Database.
- We even can’t delete messages from Azure Event Hub, but can only configure Retention Period for messages.
Also I just want to give a quick introduction onto the Azure Stream Analytics service. So this is a realtime analytics and event processing service. Therefore, Azure Event Hub is just a service for consuming or ingesting your events.
Your events could come in from a variety of data sources. They’ll come in from your IT based devices, from your logs, your files, your applications, et cetera. I wrote a Python program that could consume events from Event Hub. So Azure Event Hub itself does not have any facility as such for sending or receiving events. You basically have programs or you have services that can actually send or receive the events, onto the Azure Event Hub. So my second Python program was just looking at the information from Azure Event Hub.
Now we want to process that event and then deliver it onto a destination. Here we are looking at a realtime service, that’s Azure Steam Analytics, that could take in your data at real time, so this could be from Azure Event Hubs, could process the data and then send that data onto a destination, so Azure Synapse or even Azure Data Lake.
Next step: look into the workings of Azure Stream Analytics.