“How to collect and process massive DPI data in real time is a hot research topic for operators. The traditional MapReduce-based batch mode cannot meet the real-time requirements of streaming computing. Therefore, the related concepts of streaming processing are first introduced, and then the current popular Based on the streaming computing technology, a DPI data processing scheme based on streaming computing is proposed and applied in practical projects to meet the real-time requirements of telecom operators for data processing. Finally, the application scenarios of streaming processing are summarized through practice.
With the continuous development of the mobile Internet and the increasing penetration of various smart devices into people’s daily lives, the amount of data generated by human society is growing exponentially, and human beings have officially entered the era of big data..Today, operators can obtain more and more user data. Through DPI (Deep Packet Inspector, deep packet inspection) analysis technology, they can better identify the types of traffic on the network, the types of applications on the application layer, etc.. In this era of “data is king”, how to make full use of this important strategic asset has become a top priority.
The rapid growth of data scale has brought huge challenges to the analysis and processing of big data, especially in the communication industry, data is increasingly showing the characteristics of infinity, burst and real-time., the traditional batch mode based on MapReduce is difficult to meet the requirements of real-time data, and whether the information contained in the data can be obtained at the first time determines the value of the data.Therefore, stream processing technology has become a new hotspot in big data technology research.. Stream processing can perform real-time processing of data changes, and can obtain processing results in seconds, which is especially suitable for some scenarios that require high timeliness.
Combined with the needs of telecom operators, this paper collects and processes DPI data in real time, and proposes a DPI data processing scheme based on stream computing, which can reduce the delay of obtaining real-time information of DPI data to minutes or even seconds. Real-time processing, monitoring and categorization of telecommunications users’ online information is realized, which provides a good foundation for subsequent big data applications.
2 Streaming Overview
The traditional big data processing technology based on MapReduce is actually a batch method, as shown in Figure 1. In batch mode, the accumulation and storage of data must be completed first, and then the Hadoop client uploads the data to HDFS, and finally starts Map/Reduce for data processing, and then writes it to HDFS after processing. In this way, all data must be prepared, and then centralized calculation and value discovery must be carried out in a unified manner, which cannot meet the real-time requirements.
Figure 1 Batch Process
Figure 1 Big data processing based on MapReduce
In 2015, Nathan Marz proposed the Lambda architecture, a real-time big data processing frameworkintegrates offline computing and real-time computing, can meet the requirements of high fault tolerance, low latency and scalability of real-time systems, and can integrate various big data components such as Hadoop, Kafka, Storm, Spark and HBase.
A typical Lambda architecture is shown in Figure 2, which is mainly used in programs with complex logic and low latency. The data will be fed into the real-time system and the batch system respectively, and then each will output its own results, and the results will be merged on the query side.
Figure 2 Lambda architecture diagram
3 Comparison of Streaming Computing Architectures
Streaming computing places high demands on the fault tolerance, delay, scalability and reliability of the system. Currently, there are many streaming computing frameworks (such as Spark streamingStormKafka StreamFlinkand PipelineDBetc.) has been widely used in all walks of life, and it is still developing iteratively, and the applicable scenarios are also different.
3.1 Spark streaming
Spark is a computing framework specially designed for big data processing by the AMP Lab at the University of California, Berkeley. Spark Streaming is a real-time computing framework built on Spark and one of the core components of Spark. Through its built-in API and memory-based efficient engine, users can develop applications by combining stream processing, batch processing and interactive query.
Unlike other streaming frameworks, Spark Streaming processes only one record at a time, but discretizes the stream data and processes a batch of data (DStream) at a time, enabling it to perform fast batch processing and execution processes in seconds or less. As shown in Figure 3. The Receiver of Spark Streaming receives data in parallel and caches the data in memory. After delay optimization, the Spark engine batches short tasks (tens of milliseconds). The benefits of this design allow Spark Streaming to handle both offline and stream processing problems.
Figure 3 Spark Streaming execution process
Spark Streaming can quickly restore the state in case of failure and error. It integrates batch processing and stream processing, and has built-in rich advanced algorithm processing libraries. It develops rapidly and has an active community. Without a doubt, Spark Streaming is the standout of the streaming framework. The disadvantage is that it needs to accumulate a batch of small files before processing, so the delay will be slightly larger, and it is a quasi-real-time system.
Storm is often compared to “real-time Hadoop”. It is a real-time, distributed and highly fault-tolerant computing system developed by Twitter that can process large data streams simply and reliably. Users can use any programming language to develop applications.
In Storm, a graph-like structure used for real-time computing is called topology. The topology is submitted to the cluster, and the master node in the cluster distributes the code and assigns tasks to the worker nodes for execution. A topology includes two roles: spout and bolt. Spout sends messages and is responsible for sending data streams in the form of tuples; while bolt is responsible for converting these data streams, and operations such as mapping map and filtering can be completed in bolt. , the bolt itself can also randomly send data to other bolts.
Figure 4 Storm data flow
Storm can flow and move data in different bolts, and truly realize stream processing. It is easy to expand, has strong flexibility, and is highly focused on stream processing. Storm excels in event processing and incremental computing, processing data streams in real-time based on changing parameters.
3.3 Kafka Stream
Kafka Stream, an integral part of the Apache Kafka open source project, is a powerful, easy-to-use library that empowers Apache Kafka with stream processing capabilities.
Kafka Stream is a lightweight stream computing class library. It has no external dependencies except Apache Kafka and can be used in any Java program. Kafka is used as an internal message communication storage medium, so there is no need to deploy an additional cluster for stream processing requirements. .
Kafka Stream is easy to get started, and does not depend on other components. It is very easy to deploy, supports fault-tolerant local state, and has low latency. It is very suitable for some lightweight stream processing scenarios.
Flink is an open source computing platform for distributed data stream processing and batch data processing. It supports both batch processing and stream processing. It is mainly aimed at stream data, and regards batch data as a special case of stream data.
The core of Flink is a streaming data flow execution engine, which provides functions such as data distribution, data communication, and fault tolerance. On top of the stream execution engine, Flink provides a higher-level API for users to use. Flink also provides domain libraries for certain fields, such as Flink ML, Flink’s machine learning library, etc.
Flink is suitable for scenarios with extremely high stream processing requirements and a small number of batch tasks. The technology is compatible with native Storm and Hadoop programs and runs on clusters managed by YARN. One of the biggest limitations of Flink at the moment is that in terms of community activity, the large-scale deployment of the project is not as common as other processing frameworks.
PipelineDB is a streaming computing database based on PostgreSQL, which is very efficient. It operates on the data stream through SQL and stores the operation results. The basic process is: create a PipelineDB Stream, write SQL, operate on the Stream, and save the operation results to the continuous view.
The feature of PipelineDB is that it can only use SQL for streaming processing, no code is required, it can efficiently and continuously process streaming data, and only stores the processed data, so it is very suitable for streaming data processing, such as website traffic statistics, web page browsing statistics Wait.
3.6 Architecture Comparison
The comparison of the five streaming frameworks mentioned above is shown in Table 1:
Table 1 Comparison of streaming frameworks
Storm is characterized by maturity and is the de facto standard for streaming frameworks. The model and programming difficulty are relatively complex. The framework uses loops to process data, which consumes a lot of system resources, especially CPU resources. When the task is idle, a sleep program is required. , reduce the consumption of resources. Spark Streaming takes into account both batch processing and streaming processing, and has the strong support of Spark. It has great development potential, but the interface with Kafka is not smooth enough. Kafka Stream is a development library of Kafka. It has the characteristics of simple entry, programming, deployment and operation and maintenance, and does not need to deploy additional components. However, for multi-dimensional statistics, partitioning needs to be done based on different topics, and the programming model is complex. Flink is very similar to Spark Streaming. The difference is that Flink treats all tasks as streams. It is slightly stronger than Spark Streaming in iterative computing and memory management. The disadvantage is that the community is not active enough and not mature enough; PipelineDB is a stream computing The database can perform simple stream computing tasks. The advantage is that it basically does not require development. As long as you are familiar with SQL operations, you can easily use it. However, for cluster computing, commercial support is required.
4 DPI data processing scheme
Based on the actual task requirements and the comparison of the above streaming frameworks, due to the low programming difficulty of Kafka Stream, there is no need to install additional software, it is seamlessly connected with Kafka and other components, it is relatively stable, and various performances are relatively good, so this paper chooses Kafka Stream is the core component of stream processing.
4.1 Wideband DPI Processing
In order to complete the real-time packet capture, data filling, cleaning, conversion and merging of broadband DPI data, the above DPI data processing scheme is applied. The specific project plan is shown in Figure 5:
Figure 5 Guangzhou broadband DPI processing scheme
The Mina process is a JAVA program developed based on the mina framework. It mainly receives AAA data packets, obtains user account information, parses and calculates, persists to redis, and finally sends it to the capture program. The Capture program is written in C language. It uses open source pcap to capture the http package of the network card, parses it, and combines the user account data to write the DPI into Kafka. Kafka stream completes the real-time cleaning and conversion of DPI.
FlumeIt is a distributed, reliable, available, and efficient open source cloudera system for collecting, aggregating and moving massive data from different data sources. It is simple to configure, basically requires no development, and has low resource consumption. It supports data transfer to HDFS, and is very suitable for integration with big data systems. This project writes the stream-processed data from Kafka to HDFS through Flume, establishes a hive table, and provides data for upper-layer applications.
Kafka Stream adopts self-developed ETL framework, responsible for data filtering (removing pictures, videos, etc.), data processing (obtaining network IDs, field parsing, etc.). The ETL framework is developed in JAVA language and supports a variety of data sources, including ordinary text, compressed format and xml stereo format. It supports a variety of big data computing frameworks, including Map/Reduce, Spark streaming, Kafka Stream, and Flume, etc. It has functions such as easy expansion, field verification, wildcard support for fields, and dimension table query support. In terms of operation and maintenance, it supports functions such as variable reference and error handling.
4.2 4G DPI real-time statistics
Using telecom 4G DPI information as the data source, through stream processing, the real-time DPI statistics work is completed, including multi-granularity (5 minutes/1 hour/1 day) deduplication user statistics, multi-granularity deduplication user statistics on different terminals, multi-granularity deduplication Granular traffic statistics and multi-granularity deduplication domain name statistics, etc. The specific project plan of 4G DPI real-time statistics is shown in Figure 6:
Figure 6 4G DPI real-time statistical scheme diagram
The data source is a gzip compressed file. Because flume does not natively support .gz or .tar.gz file formats, the underlying code of Flume is modified to process the compressed file and save the decompression time. When Flume collects files, the user’s mobile phone number is used as the key of the partition, and the data of the same number is divided into the same partition, which is convenient for deduplication. Through the Kafka cluster management tool, Kafka ManagerThe status of the Kafka cluster can be well monitored. The Kafka cluster producer is shown in Figure 7:
Figure 7 Kafka cluster producer
Kafka Stream consumes 4GDPI data and processes it in parallel. Different counters are set in the program, and all data are processed by these counters. In order to solve the problem of deduplication, Bloom filtering is introduced. Although there is a certain misjudgment rate, it can still complete deduplication. performance. Similarly, consumers can also be managed through Kafka Manager, and the degree of backwardness of consumers can be observed intuitively.
In order to meet different output requirements, the program sets three outputs for selection. Data with a granularity of days will be written to MySQL as a backup, and monitoring data for hotspot areas will be output to Redis. At the same time, in order to facilitate management and data presentation, the ELK framework (ElasticSearch+Logstash+Kibana) is also used to store all data Pass it to Kibana for front-end Display. The Kibana interface is shown in Figure 8:
Figure 8 Kibana interface
5 Practice and analysis
5.1 Deployment Practices
The above two systems have been applied in actual production, both have good performance, can meet the task requirements, and have been running stably.
The broadband DPI processing project has 2 acquisition machines, 1 AAA server and 5 Kafka machines. Each acquisition machine generates 115 MB of data per second, and two 1.8 G traffic. The acquisition machine writes 330,000 pieces of Kafka per second, Kafka Stream writes 220,000 pieces of Kafka per second, and the cleaning rate (the cleaning work removes DPI information that is not related to the business, such as pictures, videos, and js requests) is 33%. The lag of Kafka Stream processing is stable at 5 million data, and the delay processing is within 15 seconds. Flume writes to HDFS at about 1 million, and the delay is within 5 seconds. The performance of the broadband DPI processing project is shown in Figure 9:
Figure 9 Broadband DPI processing project performance
There are 6 machines in the 4G DPI real-time statistics project, 1 is a Flume collector, and the remaining 5 are deployed with Kafka, Kafka Stream and ELK. The acquisition machine writes Kafka generally at 100,000 entries/second, and the peak value can reach 250,000 entries/second. The ElasticSearch cluster has a total of 8 instances, and each instance is configured with 2 GB of memory. There are currently 1.3 billion pieces of data in the cluster, occupying 361 G of space. The peak value of importing data to ElasticSearch through Logstash can reach 80,000 to 90,000 records per second. Kafka Stream lags behind data processing within 10 s, Logstash writes ElasticSearch within 5 s, as shown in Figure 10. At present, the 4G DPI real-time statistics project processes more than 15,000 files per day, with a size of 1.6 T, and the average number of records processed per day exceeds 10 billion.
Figure 10 4G DPI real-time statistical project performance
5.2 Existing problems
During the development of the 4G DPI real-time statistics project, as the demand for the project increased, the deduplication of domain names and CGIs was added later, and the same domain name or CGI was not in the same Kafka partition, resulting in deviations in the results. In order to solve this problem, the program designed a second deduplication. The result of the first deduplication output the CGI or domain name as the key to the Kafka cluster, and then did the deduplication work again, which resulted in longer delay time and complicated system maintenance.
Since deduplication is not involved in broadband DPI processing, only data filtering and data transformation, Kafka Stream is very suitable. However, in 4G DPI real-time statistics projects involving partitioning and deduplication, Storm should be used as the streaming processing framework. In Storm, data flows from one bolt to another, so that data can be partitioned by mobile phone number in one bolt, and partitioned by CGI or domain name in another bolt, which can avoid the problem of secondary deduplication and reduce the programming model. the complexity.
At the beginning of program design, an appropriate technical framework should be selected according to the needs of the application scenario. If Spark is involved in the project infrastructure, Spark Streaming is a good choice; if data transfer or deduplication is required like a 4G DPI real-time statistics project, then Storm is the first choice; if it is simple data cleaning and conversion processing, then Kafka Stream is Nice choice. For simple and small-scale real-time statistics, PipeLineDB is sufficient.
Streaming computing and batch processing of big data are suitable for different business scenarios. In scenarios with high requirements on timeliness, streaming computing has obvious advantages. This article first outlines streaming processing and its difference from batch processing, then compares the popular streaming computing frameworks in the industry, and proposes a DPI data processing solution based on Kafka Stream as the streaming processing framework according to business requirements. Components such as Flume and ELK have the characteristics of quick entry, low programming difficulty, and simple deployment and maintenance. And the solution was applied to the broadband DPI processing project and the 4G DPI real-time statistics project, which fulfilled the task requirements with excellent performance and stable operation.
In the practice of actual projects, with the increase of task requirements, it is found that Kafka Stream does not perform well in dealing with the problem of multi-dimensional data deduplication, and it is necessary to introduce secondary filtering to solve the problem. Therefore, in the project requirement stage, it is necessary to fully consider the possible problems when selecting the technical framework, and comprehensively consider the applicable scenarios of the technical framework.