casper_monitoring
casper_monitoring
net/publication/373638145
CITATIONS READS
0 312
4 authors, including:
All content following this page was uploaded by Amar Abane on 03 September 2023.
Abstract—Network management relies on extensive monitoring conditions for collecting each type of data are different, as
of network state to analyse network behavior, design optimiza- well as the data format as discussed later.
tions, plan upgrades, and conduct troubleshooting. Network
Hence, a need arises for a more general and flexible network
monitoring collects various data from network devices through
different protocols and interfaces such as NETCONF and Syslog, monitoring methodology. In this paper, we propose a network
and from monitoring tools such as Zeek and Osquery. To unify monitoring platform design that addresses these requirements.
and automate the monitoring workflow across the network, this The platform allows to gather all necessary network data from
paper identifies and discusses the data collection requirements for diverse sources, without the bottlenecks posed by centralized
network management, reviews different monitoring approaches,
monitoring. The platform enables flexible and automated data
and proposes an efficient data collection platform that addresses
the requirements through an extensible and lightweight protocol. collection with minimal communication overhead.
The platform design is demonstrated through an adaptive col- This paper is structured as follows. Section II identifies
lection of data for network management based on digital twin the main requirements in modern network monitoring. In
technology. Section III, popular monitoring solutions and approaches are
Index Terms—network monitoring, network management,
publish-subscribe, digital twin network
discussed. Section IV presents the design of the proposed data
collection platform. Section V discusses a use case for the
platform considering the emerging concept of Digital Twin
I. I NTRODUCTION
for network management. Section VI concludes the paper.
Recent advancements in network management have led to
the development of Network Management Systems (NMS) II. N ETWORK M ONITORING R EQUIREMENTS
with inventory management, network topology visualization,
configuration assistance, and network diagnostics. However, Network monitoring starts with data collection, where in-
as network complexity and service diversity increase, con- formation is requested from network devices and mapped
figuration and management errors become more frequent and into an information model, either tool-specific or general
identifying the root cause of issues becomes more challenging. (such as JSON). The formatted data is transmitted to the
To address these issues, various network analysis and trou- management station where it undergoes aggregation, filtering,
bleshooting tools have emerged to improve network manage- and representation according to the network data model. The
ment by processing all kinds of network data with artifi- network data is then used for various management purposes
cial intelligence (AI) and machine learning (ML) techniques. and some of it may be stored for auditing and long-term
Therefore, data collected from the network is crucial for the analysis.
effectiveness of these techniques. This data may include device A network monitoring platform should facilitate data col-
configuration and status, alarms, topology, port/link status, ac- lection, aggregation, and storage, including integration of
tivity logs, traffic and flow statistics, and user information, and tools that request the data [1]. Moreover, the data to collect
service performance. Network data is typically collected using varies in type, frequency, volume, and sources [4]. Hence, a
monitoring tools through a multistage process that involves suitable monitoring must be able to handle these diverse data
measuring, transmitting, aggregating, presenting, and storing types with a uniform workflow for efficient processing and
the data [1]. However, the current monitoring approaches have presentation of data. The workflow should also be extensible
limitations in gathering network data (see Section III). For to support new monitoring tools and parameters.
example, measurement platforms [2] concentrate solely on To minimize resource consumption, the platform must have
the communication performance metrics, such as end-to-end a lightweight design with minimal communication and pro-
latency. Telemetry interfaces such as NetFlow are inadequate cessing overheads. Increasing monitoring frequency can lead
in capturing device configurations. Management protocols to higher resource consumption and data generation. Hence,
are restricted to device configuration data and do not cater the platform should have the ability to dynamically adapt mon-
to devices’ performance data, or have a limited support of itoring frequency and metrics based on resource availability.
telemetry such as gNMI [3]. Furthermore, the frequency and Monitoring flexibility includes the scheduling of probes and
the ability to choose between periodic, on-demand, and event- policies. Several network monitoring tools leverage M-Lab
based monitoring. To improve efficiency, the data delivery servers for measurement coordination and data ingestion.
model should be leveraged to avoid duplicate messages. mPlane [7] is a scalable infrastructure for distributed Inter-
Each data item obtained from monitoring should have net measurement. The platform offers flexibility in monitoring
unique identification and associated metadata about the origi- through its support for single, iterative, and coordinated mea-
nating request [4]. This information is used by storage systems surements, and enables dynamic integration of user-defined
to retrieve data when needed. Enriching collected data with measurements through a probe’s capability description and
user-specific labels is also useful to improve search capabili- request mechanism. However, its point-to-point communica-
ties. tion design limits the potential of its workflow and message
Security is critical and encompasses authenticating data scheme.
sources and consumers, and managing which users are autho- The authors in [4] propose a data collection method for
rized to access each set of collected data. Whereas securing Digital Twin Network (DTN), where the data streaming com-
monitoring workflow is manageable in environments with a ponent informs the DTN of the data it can collect from network
limited set of data sources and consumers, it becomes more devices. The DTN sends commands to the data streaming
challenging as the platform flexibility increases. A schema to component to request the desired data. However, this approach
define and enforce policies is required to provide fine-grained does not address other critical considerations such as efficient
control over authorization and access control while keeping a data delivery and data identification.
reasonable complexity for certificate and key management.
C. Standardization efforts
III. BACKGROUND The standardization of network monitoring is being ad-
Four broad categories of monitoring approaches gained vanced through the efforts of consortiums and working groups.
recognition in recent years. These approaches are inspiring One such effort is the gNMI protocol [3], which offers a
for the design of a network monitoring platform that meet the vendor-neutral interface for device management. It provides a
demands outlined above. unified service for both configuration and telemetry, enabling
clients to exchange capabilities and retrieve data or subscribe
A. Internet measurement to events from devices. However, the use of the same interface
for both telemetry and configuration may result in suboptimal
There are several platforms that offer public monitoring on data delivery. While monitoring data can tolerate best-effort
the global scale of Internet. One such platform is RIPE Atlas delivery with some data loss and duplication, management
[5], which leverages probe devices hosted by users across commands necessitate reliable and consistent data delivery.
the Internet to collect data on network connectivity using Additionally, gNMI currently lacks support for essential net-
predefined probes. The data collected is made publicly acces- work diagnostic tools such as Ping and Traceroute, despite
sible and users can conduct custom measurements. Each probe their availability through the gNOI protocol, the gNMI com-
relies on registration servers to identify its controller, which plement for network operation.
manages the probe by sending a schedule for measurements
and receiving results. The results are then centrally processed, D. Cloud monitoring and logging
enriched, and stored. The utilization of monitoring tools integrated within cloud
Another popular network measurement toolkit is Perf- platforms [8] has become widespread for monitoring appli-
SONAR [2], designed to identify end-to-end network problems cations and resources, including virtual private cloud (VPC)
through measures such as bandwidth utilization, latency, and networks. These monitoring tools gather performance data,
packet loss. Its architecture is divided into three layers, with resource utilization metrics, and logs from various sources
various types of probes at the lowest layer, web services to including the cloud provider’s systems, managed products,
invoke probes in the middle layer, and a user API at the highest applications, and VMs with agents installed. The collected data
layer to trigger measurements and access results. is pushed and processed through a monitoring suite, where it
While RIPE Atlas and PerfSONAR offer valuable network undergoes filtering, ingestion, labeling, and storage. The stored
monitoring capabilities, their functionality is limited by pre- data can be further analyzed, visualized, and processed through
defined probes and do not offer an architectural foundation user-defined alerts and metrics.
for efficient data distribution among multiple producers and The data collection process is typically achieved via HTTP
consumers. endpoints to which the monitored sources continuously push
data. Although this approach is simple, with a centralized
B. Measurement facilitators
REST API, it does not offer a control over data collection
Several platforms provide flexible large scale network moni- beyond filtering the data at ingestion stage. On the other hand,
toring solutions by addressing specific aspects such as storage, cloud monitoring tools benefit from the security provided by
interoperability, or scalability. cloud platforms through the use of flexible identity and access
M-Lab [6] is a server infrastructure that facilitates mea- management (IAM). IAM allows for precise control over users
surement data exchange through effective resource allocation and services access to data and resources.
IV. P ROPOSED P LATFORM • Aggregators are services that also act as clients to other
services. They collect data from multiple services, break-
The proposed platform is named ”CaSpeR”, which ing down a complex data collection task into simpler
stands for Capability-Specification-Result/Receipt, reflecting ones, and producing aggregated results. Depending on
the message sequence that outlines the data collection work- their level of intelligence, aggregators may also provide
flow. This section describes the design of the platform. For the automated iterative monitoring, data transformation and
purpose of clarity, technical considerations such as encoding correlation, etc.
format and a detailed discussion of message structure have
been omitted. In this platform, services broadcast capability messages to
describe the data they are able to collect and the information
A. Overview required for data retrieval. Each data collection task should be
represented by a separate capability. Clients receive capabil-
The data collection platform design aims to streamline the ities and use them to request data by sending a specification
acquisition of heterogeneous data through three key features. message to the relevant service. The service responds with
Firstly, it encompasses source discovery, data request and a receipt message indicating acceptance or rejection. If the
retrieval, and automated data processing to efficiently describe specification is accepted, the collected data is disseminated
and integrate the data. Secondly, the design offers a scalable through one or multiple result messages. The service executes
solution through a flexible scheme that balances the level the specification to the best of its ability and may adjust the
of granularity in data request and the associated overhead. execution.
Lastly, the architecture is designed for easy implementation Clients and services interact in the platform without es-
and minimal impact on network resources by allowing for tablishing end-to-end sessions. Messages are exchanged via
seamless integration with the management and control plane. publish-subscribe topics. This model is chosen for its effi-
The architecture of the proposed platform shares the design ciency in disseminating messages to large groups of clients and
principles of the mPlane [7]. These principles include adopting services, reducing data duplication, and minimizing control
a unified protocol for data description, requests, and results. messages.
Its protocol facilitates the discovery of monitoring capabilities Multiple services can offer the same capability, and a single
and enables seamless coordination of their execution. Addi- client can submit specifications to multiple services. Similarly,
tionally, the architecture leverages self-contained and idempo- a single service can distribute results to multiple clients. This
tent messages, ensuring that every message carries sufficient decoupled interaction allows each service to manage the local
information to identify the monitoring task it relates to and execution of specifications to optimize resource utilization.
can be easily detected and ignored in case of duplicity.
While these design principles simplify the architecture and C. Message types
provide flexibility in controlling monitoring tasks, they do not
address all the requirements for effective data collection. To Each message conveys all necessary information for its
fully realize the potential of this approach, crucial enhance- processing, including the derivation of the topic name to
ments have been introduced to increase flexibility, enhance receive or publish the next message (see SectionIV-F). Figure 1
data semantics, and improve data distribution. These enhance- depicts an abstracted structure of the message types. The type
ments include: (i) the use of a publish-subscribe model for attribute refers to the nature of the data collection task being
exchanging messages, which reduces communication overhead described by the capability, which can range from real-time
and enables diverse data dissemination options compared to measurements (measure) to reading static data (collect) or
point-to-point protocols, (ii) allowing data sources to manage database retrieval (query). The endpoint is a structured name
the local execution of monitoring tasks through request ag- that contains the namespace in which the capability is defined
gregation and adjustment based on the solicitation level, and (e.g., /casper/useast-1/datacenter-1), the name of the capability
(iii) providing expressive data description through the use of (e.g., probe-port), and the identifier of the service or group of
semantics and application-defined labels. services providing the capability (e.g., switch-1).
Execution parameters supported by the capability are listed
B. Workflow in the parameters section, which is a map containing parameter
names and types. The allowed temporal scope is defined in
The platform comprises two main components that com- the schedule section, which is a formatted string indicating
municate through messaging: services and clients. The service start and stop time, period, etc. Parameters and schedule are
collects data and the client requests it. filled with actual values by the specification message. The
Three types of services are considered in the platform: result-keys section defines the metrics or attributes that can
• Probe services (or agents) perform basic data collection be returned by the capability, and the specification message
tasks, such as track the status of a component, run selects the metrics requested from the service. In result
measurements, or read data from a device. messages, result-values is a two-dimensional array containing
• Sink services interface with a data store to save and values corresponding to the result-keys. Remaining fields will
retrieve data results or provide graphical visualization. be introduced in later sections.
The receipt informs the client of the expected result-keys
and the topic on which the result messages will be published. If
the service performs schedule adjustment, it updates the nonce
in the receipt, and publishes the results for all specifications
that are aggregated in the same task, either via the same topic
or in separate topics for each specification. The service has
the option to skip schedule adjustment or to perform it and
still publish results in separate topics.
Fig. 1. Structure of the main messages. (=) means that the field and its
value are copied from the previous message, (+) means that the field can
be added in the current message, (|) means that the field is kept from the E. Result Management
previous message and its value is defined/updated in the current message, (∧)
denotes a field with a specific value for each message. Combination of signs The receipt is used by the clients to associate the results with
represents an alternative. a specific operation id. To present results in a concise format,
the service may opt to split the result-keys across multiple
result messages, a process referred to as result splitting. In
Figure 2 displays the relationships between messages. A this case, the service updates the result-keys in each message to
specification carries all relevant information from the ref- match the corresponding result-values and includes the original
erenced capability. A receipt includes information from the operation fingerprint for identification purposes (see Figure 1).
linked specification and updated information about the ex- The flow section is used to control the publishing of results.
pected result messages. A result contains information from The service can set the flow to ”stream” in the capability to
the specification used to generate it. The interrupt, redemption, indicate real-time streaming of results or ”batch” to indicate
result, and termination messages all provide information on the that results will be published once the operation is completed,
task requested by the original specification. Beside capability, either through a single or multiple messages. Depending on the
specification, result, and receipt, the workflow includes other nature of the operation and the available resources, a service
message types. An Interrupt is used to inform a service can enforce one flow option, or allow the client to select
to terminate the execution of a specification. A client asks the delivery mode in the specification. Upon receipt of result
a service to resend results of a specification by sending a messages, the client can organize and reassemble the results
Redemption message. A Termination message is published by based on the operation type, fingerprint, and id.
a service to inform specification execution is terminated. An The metadata section helps in handling the results. The
Exception is sent by clients/services to signal workflow errors. metadata type can be set to ”point” to indicate that each result
message represents a single point of data from the operation.
D. Data Collection Management In this case, the client can reconstruct the full operation data
An operation is the data collection process requested by a using the operation id. If the metadata type is set to ”table”, it
specification. Each operation has a unique fingerprint, which indicates that each result message contains the complete data
is a hash of the type, endpoint, parameters, and result-keys collected during the operation. The metadata format specifies
defined in the specification. The fingerprint is used to group how result-keys and result-values should be displayed in a
messages related to a specific operation. If the fingerprint chart, using chart definition languages such as Vega-lite [9].
cannot be computed from a message due to a modified field, The metadata labels carry user-defined key-value information
its value is explicitly included in the message. for tagging results. Labels defined by the service are included
Each operation has an implicit id which is generated by in subsequent specifications, receipts, and results. User-defined
combining the fingerprint with a client-generated nonce, al- labels are set in the specification, kept in the corresponding
lowing the client to differentiate between multiple executions receipt but not in the results as they may be shared among
of the same operation. The combination of the fingerprint and multiple clients.
id, along with the client’s identity information, is known as
the session, and is used by the service to manage operations F. Messaging topics
execution. Figure 2 displays the relationships between message types
The use of the fingerprint, id, and session in messages and the topics where they are published. Capabilities
between the service and client allows for a balance between are published in the ”capability” topic, while the
resource consumption, monitoring accuracy, and scalability. specification, interrupt, and redemption messages are
The service can adjust the requested operation as long as it published in the topic derived from the capability’s
complies with the specification. For example, if a specification endpoint (i.e. ”<endpoint>.control”). The receipt is
requests a probe every 10 minutes, the service can fulfill it published in the topic derived from the specification (i.e.
with an operation that produces results every 5 minutes. The ”<endpoint>.receipt.<fingerprint>.<nonce>.<timestamp>”).
service can determine if a similar operation is already running The topics where results and termination messages
by using the fingerprint, and if so, adjust its schedule to meet are published are derived from the receipt (i.e.
the new specification. This is known as schedule adjustment. ”<endpoint>.results.<fingerprint>”).
checks the message signature and verifies that the service is
authorized to produce messages for a given endpoint.
The AS manages symmetric content encryption/decryption
keys (CK) for each namespace. Services and clients retrieve
the CKs for namespaces they have access to based on their
roles. Note that, with this scheme, if a client has a result-
reader role, it can also decrypt specifications related to the
capability. However, this does not pose a significant security
threat since message derivation from a capability is clearly
defined in the protocol.
G. Security
The security of CaSpeR communications is independent of
the messaging system being used. As depicted in Figure 3, an Fig. 3. CaSpeR security scheme.
Authorization Server (AS) enables the administrator/owner to
control which permissions (publish-specification, read-result,
publish-result) are granted to each identity (client and service) V. C ASE S TUDY: A DAPTIVE DATA C OLLECTION FOR
for each capability. Clients and services are authenticated D IGITAL T WIN
through their own accounts managed by the administrator The concept of Digital Twin Network (DTN) has emerged to
on the AS. Access control is based on roles. A role groups improve network management and automation using modeling,
together a set of permissions necessary for participating in the emulation, and AI/ML techniques [11]. A DTN is a real-time
workflow. digital representation of a physical network, which can be used
Three base roles are defined. The specification-sender role to design and evaluate network optimization, plan network
enables clients to request new operations and includes publish- upgrades, conduct ”what-if” analysis, and troubleshoot the
specification and read-result permissions. The result-reader network [12].
role, allows clients to only access data from ongoing specifica- We discuss in the following how the Casper platform can be
tions and includes only the read-result permission. The result- used to collect network data for a DTN. In this case study, the
publisher role, enables services to publish data and includes DTN is a client and the data is produced by various sources
the publish-result permission. Additional roles can be created in the network acting as services.
for more fine-grained access control. The administrator/owner
can grant and revoke roles for each identity. A policy links an A. Basic data collection
identity, a role, and a set of capabilities using the hierarchical The DTN collects a variety of data from network equipment
naming structure of the endpoint. from different vendors, which use different protocols. The
The security scheme combines HTTPS between the AS platform provides a uniform interface for DTN services and
and clients/services, and self-secured encrypted messages to applications to access this data, hiding the protocol specifics.
provide authentication and authorization (see Figure 3). The To build a digital version of the network, a DTN needs
self-secured encryption scheme is similar to the role-based to continuously collect network topology (via port and link
security framework demonstrated in Named Data Networking status), device configuration, alarms and logs, and various
[10]. measurements reflecting network performance such as service
Each client and service has a certificate signed by the Key Performance Indicators (KPIs) and device telemetry.
AS. Clients and services sign each message along with its Network performance data is collected periodically. The
endpoint, allowing the receiver to verify its authenticity. A capability advertised by the corresponding service has the type
service checks the signature of a specification (or interruption, ”measure”, and uses the ”point” metadata type. Data is sent
redemption) and the client’s certificate, and uses the verified to the DTN as a stream, with the collection frequency defined
identity to retrieve the client’s role from the AS. The service in the specification’s schedule section.
can then accept or reject the specification (or interruption, re- Real-time updates of network topology changes are crit-
demption) based on the permissions allowed for the namespace ical for effective operation of the DTN. The corresponding
to which the specification endpoint belongs. Similarly, a client capability has the type ”measure”, and uses the ”point ”
metadata type. The ”on-event” option in the ”schedule” section While a DTN system can automate monitoring and opera-
of the specification allows for real-time reception of topology tion, human expertise is still crucial in production networks.
changes. To aid in this, sink services with graphical user interfaces can
Device configuration data is locally stored on the device and be deployed with minimal overhead as they consume copies
collected by a service, which advertises it using a ”collect” of the data that is sent to the DTN.
capability with a ”table” metadata type. In a device, some
configurations change on a daily basis, and others change VI. C ONCLUSION
rarely or less frequently [13]. To handle that, the DTN
sends two specifications for the same capability, one for the Collecting large amounts of heterogeneous data becomes
infrequent changes and one for the frequent changes. Both necessary for modern network management tools, whereas
specifications set the flow section to ”batch”. To collect data it used to be an additional feature in traditional NMS and
from all devices while reducing the number of exchanged Software-Defined Networking (SDN) solutions. Therefore,
messages, one capability can be implemented to collect data data collection needs a dedicated workflow instead of being
from more than one device, using one column in the result- implemented alongside control protocols as it has been so
keys for the device name and specifying the device(s) to collect far. This need is addressed by proposing an extensible data
data from in the parameters section. collection platform that encapsulates the various interfaces
Logs and alarms are parsed at the service and described used in network monitoring.
using a collect capability with ”table” metadata type. The DTN The platform can also be used for other telemetry purposes.
can have logs published periodically and alarms received in For example, it is currently used for optical quantum network
real-time. metrology [15].