0% found this document useful (0 votes)
72 views

Bridging Machine Learning and Computer Network Res

Uploaded by

Maxwell Aniakor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Bridging Machine Learning and Computer Network Res

Uploaded by

Maxwell Aniakor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329325877

Bridging machine learning and computer network research: a survey

Article  in  CCF Transactions on Networking · November 2018


DOI: 10.1007/s42045-018-0009-7

CITATIONS READS

10 3,030

6 authors, including:

Jinkun Geng Junfeng Li


Tsinghua University Tsinghua University
39 PUBLICATIONS   89 CITATIONS    24 PUBLICATIONS   47 CITATIONS   

SEE PROFILE SEE PROFILE

Dan Li
China Medical University (PRC)
139 PUBLICATIONS   3,374 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

User-level Network Protocol Stack View project

CleanSky - Network for Cloud Computing Ecosystem View project

All content following this page was uploaded by Jinkun Geng on 03 June 2019.

The user has requested enhancement of the downloaded file.


CCF Transactions on Networking (2019) 1:1–15
https://doi.org/10.1007/s42045-018-0009-7

REVIEW PAPER

Bridging machine learning and computer network research: a survey


Yang Cheng1 · Jinkun Geng1 · Yanshu Wang1 · Junfeng Li1 · Dan Li1 · Jianping Wu1

Received: 6 November 2017 / Accepted: 14 November 2018 / Published online: 30 November 2018
© China Computer Federation (CCF) 2018

Abstract
With the booming development of artificial intelligence (AI), a series of relevant applications are emerging and promoting an
all-rounded reform of the industry. As the major technology of AI, machine learning (ML) shows great potential in solving
network challenges. Network optimization, in return, brings significant performance gains for ML applications, in particular
distributed machine learning. In this paper, we conduct a survey on combining ML technologies with network research.

Keywords  Artificial intelligence · Machine learning · Computer network · Directions and challenges

1 Introduction as Tabu search, deep learning and reinforcement learning,


etc., to find the near-optimal solution of NP-hard network
In recent years, there has been a rapid development in arti- resource management problems efficiently. For Quality of
ficial intelligence (AI), especially with the breakthrough of service (QoS) optimization, ML technologies can support
machine learning and deep learning. In the field of network, much more control policies and complicated models than
it is generally acknowledged that applying ML techniques traditional algorithms so as to maintain higher performance
to practical problems is potential and promising. Some pio- of network services.
neering works have been carried out to bridge network with On the other hand, network optimization can also benefit
ML. the ML workflow and bring significant performance gains.
On one hand, ML technologies can be adopted to solve As the amount of training data increases rapidly and ML
the complicated challenges in network scenario. Instead of models become more complicated, the computation require-
laborious human operators, ML techniques are utilized to ment is beyond the capability of single machine, and thus
train a detection model to monitor the performance of net- dozens of distributed machine learning platforms emerge
work system and identify the misconfiguration and malicious recently. However, the expensive communication cost has
attacks with high efficiency and accuracy. It has been an caused several bottlenecks for these platforms. Network opti-
emerging trend to utilize classic intelligent algorithms, such mization, such as decentralized topologies, communication
compression and network processing offload schemes, has
improved the overall performance of these distributed ML
* Dan Li platforms.
[email protected]
The combination of ML and network proves to be a
Yang Cheng tempting direction with much remained for further explora-
cheng‑[email protected]
tion. Owing that ML & network is a fresh interdisciplinary
Jinkun Geng area and there is currently a lack of systematic review of
[email protected]
relevant works in these areas, this paper conducts a compre-
Yanshu Wang hensive survey and summarizes the existing works into three
[email protected]
main dimensions: machine learning based network manage-
Junfeng Li ment solutions, machine learning based network security
[email protected]
& privacy solutions, and network for distributed machine
Jianping Wu learning. Further, we will focus on several typical scenarios
[email protected]
in network and bring forth the future directions and chal-
1
Department of Computer Science and Technology, Tsinghua lenges in this interdisciplinary area.
University, Beijing, China

13
Vol.:(0123456789)
2 Y. Cheng et al.

The rest of the paper is organized as follows. Section 2 Model from the history data accumulated by experienced
provides an overview of machine learning based network operators. Human operators can interact with the system
management solutions. Section 3 reviews the significant periodically to set hyper-parameters and input fresh training
works in machine learning based network security and pri- data for correction. In most time, Opprentice is able to inde-
vacy solutions. Section 4 surveys representative works in pendently conduct online detection of possible anomalies
network for distributed machine learning. We summarize without the participation of network operators. In this way,
and bring forth the future directions in Sect. 5. the accumulated history data is utilized and the human labor
is much reduced. Winnowing Algorithm Lutu et al. (2014)
is another successful application in router management to
2 Machine learning based network distinguish the unintended limited visibility prefix (LVPs)
management solutions caused by misconfigurations or unforeseen routing policies.
The essence of Winnowing Algorithm is the decision-tree
Operation & management have always been a major part in based classification. In the training process, there are many
network engineering, especially with the booming of cloud decision trees established based on the labeled data. With
computing and development of data center. Since the scale the boosted tree model, Winnowing Algorithm is able to
of network has developed to a great extent, it becomes more distinguish unintended LVPs from the ones, which are the
challenging for network operators to manage such a large stable expression of intended routing policies to detect the
network and guarantee the service quality provided to cus- anomalous events in the routing system.
tomers. Traditional operating methodology requires a large
amount of manual work, which incurs a laborious burden to 2.1.2 Maintenance without sufficient history data
network operators. The rise of ML techniques has brought
new opportunities to free network operators from the heavy On the other hand, when there is not enough data to train
workload. Meanwhile, through machine learning techniques, a suitable model for the practical scenario, transfer learn-
the system performance can be optimized and resources can ing becomes a considerable choice. The SMART solution
be better utilized. Specifically, ML techniques have been Botezatu et al. (2016), proposed by researchers at IBM,
applied in the following aspects of network operation & focuses on disk replacement in data center and trains a clas-
management. sification model to determine whether the disk should be
replaced. Considering the data come from different scenarios
2.1 Intelligent maintenance and fault detection with different distributions, they take advantage of transfer
learning to eliminate sample selection bias Zadrozny (2004).
Performance anomalies can damage the service quality In this way, the data from different scenarios are transferred
provided to customers and it is a critical issue for network to train the ML model and further help the operator with
operators to detect or prevent anomalies in routine main- disk replacement.
tenance. Many explorative works have been conducted
focusing on the design of anomaly detectors Fontugne et al. 2.2 Resource management and job scheduling
(2010), Soule et al. (2005), Li et al. (2006), Yamada et al.
(2013), Ashfaq et al. (2010). However, traditional moni- Resource management and job scheduling have always
toring mechanisms are time-consuming and less effective, been hot topics in data center, especially as the scale grows
which require expert knowledge from operators and involve large and the communication becomes more complicated
laborious manual work. Examining this, some novel solu- Chowdhury and Stoica (2015), Ma et al. (2016). Usually the
tions have been proposed to utilize ML techniques for net- problem can be formulated into an NP-hard problem, and
work system maintenance. past non-AI works adopt simple heuristic methods to solve
the problem Ballani et al. (2011), Li et al. (2016), Xie et al.
2.1.1 Maintenance with sufficient history data (2012), Zhu et al. (2012). Nowadays there is an emerging
trend to utilize classic intelligent algorithm to find better
Given sufficient history data, it is feasible to train a detection solutions efficiently. Generally speaking, a proper resource
model so that it can replace human operators to monitor the management solution focuses on two aspects of demands,
performance of network system. Opprentice Liu et al. (2015) i.e. utilization-driven and energy-saving. On one hand, the
is a novel framework to detect performance anomalies with solution is expected to improve the resource utilization and
reference to the KPI data. The main objective of Opprentice accelerate the progress. On the other hand, energy consump-
is to choose suitable detectors and tune them to detect real- tion should be reduced in the data center, especially in the
time anomalies without the participation of network opera- dynamic scenario that involves VM (or container) migration.
tors. To attain this, Opprentice constructs a Random Forest Besides utilization-driven and energy-saving objectives,

13
Bridging machine learning and computer network research: a survey 3

there are also some works pursuing a hybrid objective to is also implemented with Q-learning to balance QoS reve-
reach a better trade-off among various performance metrics. nue and power consumption in geo-distributed data centers
Zhou et al. (2016). The objective function quantities QoS
2.2.1 Utilization‑driven solution revenue and power consumption with a simple weighted
sum, besides, optimization techniques are integrated to
Libra Ma et al. (2016) is a representative work focusing on accelerate the solving computation. By tuning the hyper-
the first aspect and it aims to maximize isolation guaran- parameter in the weighted sum, adaptive policies can be
tee with an optimal placement for parallel applications in generated from the RL model to cater to various services
the data center. In order to maximize isolation guarantee, in geo-distributed data centers.
Tabu search is adopted in Libra to help find an optimal solu- Moreover, there are also some research works focusing
tion for container placement at an affordable computation on the prediction of future situation to determine the cur-
cost. Besides the traditional intelligent algorithms such as rent resource management solutions. For example, the user
Tabu search, novel techniques including deep learning and demand can be modeled with neural network classifiers, then
machine learning, are also applied in resource management adaptive solutions are generated to determine the resource
and job scheduling. DeepRM Mao et al. (2016) is a pio- configuration and job scheduling in data center Bao et al.
neering work in adopting deep reinforcement learning tech- (2016). Recent works, such as Tan et al. (2017), Zhang et al.
nology to manage resources in network system. It defines (2017), He et al. (2013), adopt online learning techniques
average job slowdown to quantity the normalized progress to predict future workload and reduce the cost of time and
rate of jobs and aims to minimize the metric with a standard resource.
policy gradient algorithm. DeepRM proves the feasibility of
Deep RL techniques in resource management problems, and
motivates fresh ideas for following research. 2.3 Service performance optimization

2.2.2 Energy‑saving solution AI&ML can be effective tools to optimize the performance


of various applications for better network services, such as
Compared with Libra and DeepRM, MadVM Zheng et al. video streaming services, web searching services, content
(2013) imposes its emphasis on the second aspect and it delivery services and so on.
aims to reduce the energy consumption during VM man-
agement. The main idea of MadVM is to approximate the
practical scenario of VM migration with Markov Decision 2.3.1 Video service optimization
Process (MDP). Under the framework of MDP, the objec-
tives for the MDP problem is quantified with reference The video quality can be affected by many factors during
to both power consumption and resource shortage. Some the delivery Jiang et al. (2016) and there have been many
approximation tricks are integrated to reduce high dimen- prior works targeting at a better user experience. Since the
sions and the MDP is able to provide a near-optimal solution network quality can be fluctuating, adaptive bitrate (ABR)
with energy efficiency. algorithms receive lots of concerns because it aims to select
the proper bitrate based on the changing network condi-
2.2.3 Hybrid objective based solution tions. CS2P Sun et al. (2016) improves the bitrate adaptation
based on the network throughput. Inspired by the similar-
Instead of focusing on one objective, some solutions ity of throughput patterns among different video sessions,
cover more aspects in the management.SmartYarn Xu CS2P establishes a Hidden-Markov-Model (HMM) to
et al. (2016) starts from the point that different resource predict the throughput and further execute adaption deci-
configurations may lead to similar performance, therefore, sions. The HMM model proves to be effective in throughput
an optimal configuration of multi-resources is expected prediction and contributes to a better quality of experience
to provide desired service quality, as well as save much (QoE). However, most ABR algorithms can hardly adapt to
cost. To attain this, reinforcement learning is adopted in a broad range of network conditions and objectives due to
SmartYarn with consideration of both service-level agree- its fixed control rules and simplified models. Therefore, Pen-
ment and cost efficiency. Based on the usage-based pay- sieve Mao et al. (2017) resorts to reinforcement learning for
for-resources model, the cost efficiency is quantified sub- bitrate adaptation. Instead of fixed control policies, Pensieve
jected to the constraints of performance requirement. Then establishes a neural network and learns the bitrate control
SmartYarn adopts the popular reinforcement learning policies automatically. In this way, Pensieve outperforms
algorithm, Q-learning, to solve the problem and achieve the state-of-art ABR algorithms under a variety of network
the approximated optimization efficiently. A similar work conditions and QoE objectives.

13
4 Y. Cheng et al.

2.3.2 Web search optimization from the large amount of data packets Santiago et al. (2012),
Baralis et al. (2013), Franc et al. (2015), Bartos et al. (2016),
Apart from the quality management of video streaming, Xu et al. (2015), Antonakakis et al. (2012). Via clustering
web search is another hot field in QoS optimization. From and classification, similar patterns are mined out among data
the perspective of user experience, response time is a key packets, which are helpful for applications such as security
attribute to consider during the QoS optimization for web analysis and user profiling. The combination of classification
services. The optimization towards high search response and traffic analysis still remains as a hot topic in recent years,
time (HSRT) is challenging with a heavy workload. Such nevertheless, some other machine learning algorithms also
a situation motivates the combination of machine learn- come into use for traffic analysis.
ing techniques and HSRT optimization. FOCUS Liu et al.
(2016) is the first work in this direction and it adopts deci- 2.4.1 Natural language processing for traffic analysis
sion tree model to learn the domain knowledge automatically
from search logs and identify the major factors responsible Proword Zhang et al. (2014) leverages natural language
for HSRT conditions. processing technique in protocol analysis: First, Proword
designs Voting Experts (VE) algorithm to select the most
2.3.3 CDN service optimization possible boundary positions for word partitioning. Based on
the candidate feature words extracted by the VE algorithm,
PACL (Privacy-Aware Contextual Localizer) Das et  al. Proword tries to mine out the protocol features from these
(2014) adopts similar technique as FOCUS, but it aims to words. The candidate words are ranked with pre-defined
learn users contextual location and further improve the qual- score rules and the top k of them serve as the feature words.
ity of content delivery. PACL models the mobile traffic with The combination of NLP and protocol analysis demonstrates
a decision tree, together with pruning techniques. Then the a higher accuracy over traditional protocol analysis methods.
most significant attributes are identified to imply user loca-
tion contexts. With the predicted context information, PACL
is able to choose the nearest CDN node for content delivery, 2.4.2 Exploratory factor analysis for traffic analysis
thus reducing the waiting time in data transfer.
Exploratory factor analysis (EFA) emerges as a novel fac-
2.3.4 Congestion control optimization tor analysis technique, and it is believed to be more effec-
tive than traditional principal component analysis (PCA)
Machine learning can also be integrated into TCP conges- techniques for multivariate analysis. The recent work Furno
tion control (CC) mechanism to improve the network perfor- et al. (2017) adopts EFA technique to mobile traffic data and
mance. For example, it has been used to classify congestive bridge the temporal and spatial structure analysis. This work
and non-congestive loss Jayaraj et al. (2008), forecast TCP fills the gap in this joint area with better or equal results
throughput Mirza et al. (2007), and for better RTT estima- compared to state-of-art solutions.
tion Nunes et al. (2011). Remy Winstein et al. (2013) for-
malizes the multi-user congestion control problem as an 2.4.3 Transfer learning for traffic analysis
MDP and learns the optimum policy offline. It needs intense
offline computation and the performance of the RemyCCs Transfer learning also contributes to the traffic analysis
depends on the accuracy of the network and traffic models. and security threat detection in Bartos et al. (2016), which
PCC Dong et al. (2015) adaptively adjusts its sending rate proposes a classification system to detect both known and
based on continuous profiling, but it is entirely rate-based previously unseen security threats. Since there may be
and its performance depends on the accuracy of the clock- biases between the training domain and target domain, the
ing. The learnability of TCP CC was examined in Sivaraman knowledge acquired from traditional training model can-
et al. (2014), where RemyCC was used to understand what not be directly applied to the target cases. Through transfer
kinds of imperfect knowledge on network model would hurt learning technique, the feature values are transformed into
the learnability of TCP CC more than others. Q-learning an invariant representation for domain adaptation. Such an
based TCP Li et al. (2016) is the first attempt (that we know invariant representation is integrated into the classification
of) that uses Q-learning to design the TCP CC. system, which helps to categorize traffic into malicious or
legitimate classes.
2.4 Traffic analysis Traffic analysis perhaps is the closest area related to
machine learning techniques and it involves lots of efforts
Traditional clustering and classification methods are widely from computer network researchers. The broad applications
applied in earlier works aim to find valuable information of traffic analysis lie in the security research, such as user

13
Bridging machine learning and computer network research: a survey 5

profiling and anomaly detection. We will further discuss detect whether there is any anomaly hidden in the graph.
these aspects in the following section. Such a mapping from anomaly detection to graph mining
gains a good performance in efficiency and scalability.
In most of their works Neuvirth et al. (2015), Zhang et al.
3 Machine learning based network security (2014), classic algorithms are trained by the system logs and
and privacy solutions other KPI, so the key factor to get more precise results is
how to find a well-organized set of features, it seems to be
Network security and privacy are a broad area covering a poor generalized. More precise result and generalization can
range of issues, including anomaly detection, user authenti- be gained from another view Nandi et al. (2016).
cation, attack defense, privacy protection and other aspects.
With the explosion of network data in recent years, tradi- 3.2 Authentication and privacy protection
tional methodologies are confronted with more challenges
in detection and defense of emerging attacks. Inspired by the Authentication guarantees user privacy and prevents infor-
success on the traditional area, researchers try to use classic mation leakage. The user identity is supposed to be authenti-
ML methodology(e.g. Naive Bayesian Nandi et al. (2016), cated with a reliable mechanism, and then the user is granted
KNN Wang et al. (2014), regression Nandi et al. (2016), authorization for access.
decision tree Zheng et al. (2014), Soska and Christin (2014), The popularity of mobile devices causes more problems
SVM Franc et al. (2015), random forest Hayes and Danezis for user authentication and privacy protection. On one hand,
(2016)) to solve a series of complicated security problems. numerous applications on the phone require collecting user’s
behavior data for better service, but the collected informa-
3.1 Anomaly detection on cloud platform tion may be exploited by adversaries, further incurring
leakage and threatening user privacy. On the other hand,
Anomaly such as the misconfiguration and vulnerable attack mobile users may be unaware of information protection so
have a great impact on the system security and stability, the that their authentications may be stolen by others easily.
detection can be very challenging without sufficient knowl- Complex password and/or secondary verification mechanism
edge of the system conditions. In most cases, detection can reduce the risk of authentication attacks, but also bring
requires analyzing a large amount of system log, which is not significant inconvenience to the users.
possible for human operators. Machine learning techniques Observed the difference on behaviors from different user
thus provide alternative ways for manual work. Zheng et al. (2014) designed a novel authentication mecha-
Since the cloud platform is more and more popular, many nism to improve the privacy by profiling a user behavior
attackers are targeting on how to steal the rich resource of according to their habit while use smartphone. More specifi-
cloud platform, thus, how to detect the anomaly behavior cally, they model the nearest neighbor distance and a deci-
on a cloud platform is arising a wide concern of research- sion score based on the features(e.g. acceleration, pressure,
ers. FraudStormML Neuvirth et al. (2015) adopts super- size and time) to judge whether the behaviors belong to the
vised regression methods to detect fraudulent use of cloud same user and whether the authentication should be granted.
resources in Microsoft Azure. It can detect the fraud storms Besides analysis on the user side, analysis on the adver-
with reference to their utilization of resources like the band- saries’ side is also worthwhile. It has been observed that
width and computation unit, thus raising early alerts to pre- the behavior between user and adversaries keep chang-
vent the illegal consumption of cloud computing resources. ing dynamically Wang and Zhang (2014). The changes of
APFC 3 Zhang et al. (2014) is another system developed user’s behavior and adversaries action can be modeled with
with a hierarchical clustering algorithm to automatically a two-state Markov chain for inferring user’s behavior. The
profile the maximum capacities of various cross-VM cov- action between user and adversary is generalized as a zero-
ert channels on different cloud platforms. Convert channel sum game: the user tries to change his behavior making
may cause information leakage and be utilized by hackers the adversary fail to predict their next state, whereas the
to threaten system security. adversary tries to predict user’s behavior by adjusting their
The maturity of NLP techniques brings new ideas to strategies. The zero-sum game can be solved by a minimax
anomaly detection and it proves to be an efficient way to learning algorithm with provable convergence to obtain the
detect the anomalous issues based on their execution logs. optimal strategies for users to protect their privacy against
Nandi et al. Nandi et al. (2016) follow this idea and con- the malicious adversaries.
structs graph abstraction with the execution logs: the execu- Fingerprinting technique is regarded as an effective
tions in the distributed applications are abstracted as nodes method for behavior identification even under the circum-
and workflows are abstracted as edges to connect them. stances with encryption, which can be utilized by both
Naive Bayesian and linear regression models are used to defender and attackers. AppScanner Taylor et al. (2016) is a

13
6 Y. Cheng et al.

typical work which uses automatic fingerprinting technique domains by learning from a public-available blacklist on
to identify android apps based on the analysis of network malicious domains. Different to traditional methods, it can
traffic generated by the apps. It adopts the SVM classifier learn efficiently from a weak label dataset by a MIL (mul-
and random forest to establish its identification model. By tiple instance learning) framework, the main idea is that: it
using flow characteristics such as the packet length and traf- firstly extracts features from the proxy log, and analyzes the
fic direction, AppScanner is able to identify sensitive apps, correlation between a huge amounts of unlabeled data with
which may cause attacks. Since AppScanner focuses on the a small fraction of labeled data, further deriving the weak
statistical characters, it works well against encryption. In labels for them. Which give an inspiration on how to learn
addition, fingerprinting attack is seen as a serious threat to from weakly labelled dataset.
online privacy. Apart from the deficiency of labeled data for training,
The privacy is a big concern with the popularity of another problem in malicious behavior detection lies in the
mobile device, more knowledges on the user behavior help difficulty of understanding the data representation, since
the service provider offering a better User Experience, it can attackers may hide their behaviors with traffic obfuscation
also be exploited by malicious attackers to put the owner in to escape being tracked. Confronted with such a problem,
danger. So it is more like an adversarial game, many works a robust representation suite is proposed in Bartos et al.
mentioned before bringing novel ideas to solve these sce- (2016) for classifying evolving malicious behaviors from
narios. In general, how to figure out the malicious behaviors obfuscated traffic. It groups sets of network flows into bags
while keeping a high user experience should be taken into and represents them with a combination of feature values
consideration. and feature differences. The representation is designed to
be resilient to feature shifting and scaling and oblivious to
3.3 Web security and attack detection bag permutation and size changes. The proposed optimi-
zation method learns the parameters of the representation
Web security is another rigorous issues today, and many automatically from the training data (SVM based to learn
websites are suffering attack caused by the vulnerability the number and size of bin of historical graph), allowing the
and some other factors. The recent k-fingerprinting Hayes classifiers to create robust models of malicious behaviors
and Danezis (2016) attack employs website fingerprinting capable of detecting previously unseen malicious variants
technique to launch an attack even confronted with a large and behavior changes.
amount of noisy data and encrypted traffic. It adopted a ran- Unlike the malicious obfuscation, traffic encryption,
dom decision forest to construct the website fingerprinting which is usually applied for privacy protection, also causes
and trains it by a set of features (such as the burst or packet great challenges to identify the malicious flow stream. To
length instead of plain text) and proves to be an efficient analyze those malicious traffic flow with encrypted payload,
methodology on identifying which websites the victim is the meta-information is used in Comar et al. (2013), such
visiting based on history datum, which can be used in further as packet length and time interval, to train the model. A
attack behaviors. two-level framework is constructed combining an existing
How to identify the malicious websites is arising a great IDS and the self-developed SVM algorithm, which proves
interest from academic. In general, traffic inspection serves to be effective to identify malicious traffic from tremendous
as a common and effective method for malicious behavior flow data.
identification. The recent Soska and Christin (2014) pro- The methods mentioned above may not work so well
poses a general approach for predicting websites propensity compared to raw traffic analysis, limited by the obfuscation
to become malicious in the future. It adopts C4.5 decision and encryption, some more advanced methodology may be
tree trained by Relevant features (e.g. distributed from traf- proposed further to cater to those scenarios, on the other
fic statistic, file system webpage structure and contents) to hand, many works are trying to remedy the limited condition
identify whether the page is going to be malicious. from other views.
However, the increasing variety of network applications SpiderWeb Stringhini et al. (2013) offers a new approach
and protocols, as well as the widely using encryption tech- to detect malicious web pages by using the redirection
niques, makes it more challenging to identify malicious graphs. It collects HTTP redirection data from a large and
behavior from the huge mass of traffic. Meanwhiles, many diverse collection of web users and aggregates the differ-
relevant works have brought novel ideas on how to train ent redirection chains that lead to a specific web page, and
model with limited dataset and make ML more powerful in then it analyzes the characteristics of the redirection graph,
web security issues. extract 28 features to represent the redirection graph. By
The recent work Franc et al. (2015) is focused on how inputting these features into the SVM classifier, SpiderWeb
to learn the traffic detector from weak labels, which adopt is able to identify the malicious web page more accurately
a novel Neyman Pearson model to identify the malicious than previous methods.

13
Bridging machine learning and computer network research: a survey 7

More deeply, MEERKAT Borgolte et al. (2015) brings a results by tuning parameter and advanced model, one the
novel approach based on the “look and feel” of a website to other hand, attackers are trying to obfuscate their mali-
identify if the website has been defaced. Different from pre- cious behavior with normal data. This inverses relationship
vious works, MEERKAT leverages recent computer vision between defender and attacker makes ML-based security
techniques and directly takes the snapshot as input. It can issues more complicated since ML techniques can be lever-
automatically learn high-level features from data directly and aged by both the attackers and defenders.
does not rely on additional information supplied by the web- In summary, ML possesses potential power in security
site’s operator. MEERKAT employs a stacked autoencoder research. However, more factors need to consider and there
neural network to “feel” the high-level features. The features are still many open problems remained in this area.
are extracted by the machine automatically and input into a
feedforward neural network to identify the defaced websites.
Web security is rigorous and harder to solve by ML- 4 Network for distributed machine learning
based methodology, confronting with the obfuscation and platform
encryption technologies. On the one hand, researchers can
develop more advanced model catering to such harsh condi- The rapid development of the Internet has led to the explo-
tion, meanwhiles, some novel angles from other areas may sion of business data amount as well as the increasing com-
bring new chances. plexity of training models. The time-consuming training
process and heavy workload make it even impossible to
3.4 Barriers in ML‑based security research undertake these tasks on one single machine, therefore, dis-
tributed computing becomes an alternative way to consider.
Researchers have focused on applying classic ML method- In recent years, there have been some representative works
ology to solve the security issues and get a great success. conducted towards distributing machine learning platforms,
However, more barriers still remain in this area and here we such as Hadoop hadoop (2009), Spark Zaharia et al. (2010),
summarize three main aspects as follows. GraphLab Low et al. (2014), DistBelief Dean et al. (2012),
Challenge of model design Network security issues are Tensorflow Abadi et al. (2016), MXNet Chen et al. (2015),
mainly analyzed on the basis of traffic traces and system etc. Generally speaking, there are a couple of major issues
logs. Current research simply borrows the models from to concern during the construction of an efficient distributed
other areas (such as computer vision and pattern recogni- machining learning platform: (1) network topology (2) par-
tion) to the security scenario. However, it remains as a key allelism and synchronization (3) communication and scal-
concern how to design more effective models to mine the ability, etc.
complex relationship hidden in these data, Wang (2015)
brings a novel idea on how to map network traces to other 4.1 Network topology for distributed machine
areas, it maps traffic to a bitmap and adopts autoencoder to learning
distinguish different traffics inspired by CV, it works well
on unencrypted raw network traffic, however not suitable The architecture design of the distributed platform can
for the encrypted one, other work [Nandi et al. (2016) map impose significant impacts on the execution efficiency and
system logs to DAG problems] also gives inspiration on how overall performance; meanwhile, it has a close relationship
to map network security issues to traditional area. How to to other issues, such as fault tolerance and scalability. So
design effective model catered to network security scenarios far, there has been two types of architectural prototypes pro-
is deserved to be deeply explored in future. posed, i.e. Parameter Server-based (PS-based) architecture
Lack of training dataset In traditional area (e.g. CV, and Ring-based architecture.
NLP and speech recognition, etc.), many datasets (e.g. Ima-
geNet, MNIST, SQuAD, Billion Words, bAbi, TED-LIUM, 4.1.1 PS‑based architecture
LibriSpeech, etc.) are public to academia and industry,
researchers can get access to those resources and develop PS-based architecture Chilimbi et al. (2014) is illustrated
more advanced model. In the security area, many factors as Fig.  1. In the PS-based design, the machines (or nodes1)
(such as political and commercial concerns) constrain the are organized in a centralized way and there is a functional
access to ground-truth network dataset, thus making it even difference among them. Some nodes work as parameter
harder to apply machine learning technologies to solve the servers (PSs) whereas the others work as workers. PSs are
network security issues. responsible for managing the parameters of the model and
Adversarial and game theory The security problem is
more like a competitive game with many factors involved.
On the one hand, defenders are trying to get more precise 1
  In this paper, we use node and machine as synonyms.

13
8 Y. Cheng et al.

Fig. 1  Parameter server architecture

Fig. 3  Steps in scatter stage

two neighbors directly connected. Under the framework of


Ring-based architecture, each node works iteratively and the
execution can be decomposed into two steps, scatter and
gather.
For simplicity, we use a 6-node ring to illustrate the exe-
cution of scatter and gather (Fig.  3). Similar to the workers
in PS-based architecture, each node in the ring holds a part
of the training data. During each iteration, the 6 nodes in the
Fig. 2  Ring-based architecture
ring run independently and compute the parameters for the
training model. Then they will execute scatter and gather
coordinating the execution of workers. Workers undertake procedure to synchronize the parameters. After these steps,
the training tasks and submit their updated parameters to each of them will gain the same parameters and continue to
PSs periodically. execute the next iteration.
Take the gradient descent (GD) algorithm as an example, Scatter As shown in Fig.  2, all the six workers will hold
which is a common iterative method adopted for training one copy of the training model and compute the model
neural networks. Under the framework of PS-based architec- parameters parallel. Then after each of them has computed
ture, each worker will hold a part of the training data. In the their local parameters, the scatter procedure is triggered for
beginning, workers will pull the model parameters from PSs, each node to exchange its parameters with neighbors. Each
and then train its model independently with its local data. node will evenly split its local parameters into several shares
After a certain number of computation iterations, workers and the number of shares equals the number of nodes. In this
will push their new calculating gradients to PSs, which will scenario, each node divides its local parameters into 6 parts
aggregate these gradients to update the whole model. Work- and transfers one part for each scatter.
ers will then again pull the updated model and continue their In this case, it can be implied that after 5 times of scat-
training processes. ter, each node will possess one part of global parameters,
which sums up the original parameters from each node cor-
4.1.2 Ring‑based architecture respondingly. The scatter procedure completes and it comes
to the gather procedure to synchronize the parameters for
Unlike PS-based architecture, which follows a centralized each node.
principle and keeps functional differences among nodes, Gather Since each node possesses only one part of global
Ring-based ringspshpc (2017) architecture regards each parameters (1/6 parameters for each node in this scenario)
node equally in logic and works in a complete decentral- after the scatter procedure, the gather procedure tries to syn-
ized way. chronize the parameters. Similar to scatter, each node will
As illustrated in Fig.  2, the nodes are organized as a ring send its global parameters to its right neighbour and receive
and each node in the ring works in the same way. Compared the global parameters from its left neighbour. As shown in
with PS-based architecture, the node in Ring-based archi- Fig.  4, during the first gather , node0 receives the global
tecture does not need to pull/push their parameters to/from parameters from node5 and passes its own global parameters
a central server, instead, they just communicated with their to node1 . During the second gather, node1 receives the global

13
Bridging machine learning and computer network research: a survey 9

the scalability of Ring-based architecture is not so good


as PS-based architecture. However, compared with Ring-
based architecture, PS-based architecture consumes more
machines and more links, which can restrict its scalability
in large scale. Ring-based architecture requires no central-
ized nodes and takes a full utilization of each machine. From
this perspective, it scales better than PS-based architecture.
In short, the comparison on scalability should refer to the
specific scenarios as well as the key constraints. Either archi-
tecture can be a better option to choose.

4.2 Parallelism and synchronization

Parallelism is a key factor that helps to accelerate the train-


Fig. 4  Steps in gather stage
ing process for distributed machine learning platforms.
Generally speaking, there are two main types of parallelism
parameters of node4 indirectly via node5 , meanwhile,node0 modes to concern, i.e. data parallelism and model parallel-
also passes the fresh global parameters to node1 , which has ism. Confronted with different business models, each plat-
been received from node5 in the last gather. form has its own emphasis on the parallelism modes.
Similar to the scatter procedure, after 5 gathers, the
parameters on each node will be synchronized and each node 4.2.1 Data parallelism
will hold the same global parameters for the next iteration.
Data parallelism is one common parallelism mode for dis-
4.1.3 Comparison between PS‑based and ring‑based tributed systems. where each node works with the same
architecture training model, in other words, each node holds a complete
copy of the model parameters, The differences between
PS-based architecture decouples the execution into model nodes mainly lie in the training data. Confronted with a large
training and parameter synchronizing, which are undertaken amount of training data, each node can only hold one part
by workers and PSs correspondingly. Such a design strength- and conduct training process with the training data stored in
ens the robustness: when failures occur in several workers, its own machine. After a certain period, each node commu-
the system will maintain a graceful performance degradation nicates with each other to synchronize their model param-
Guo et al. (2009). However, when it comes to Ring-based eters for further training. As mentioned, different architec-
architecture, even one single node fails, the entire ring will tures (PS-based or Ring-based) use different synchronization
collapse and the performance will degrade sharply. mechanisms. The data parallelism has been applied in a wide
However, since PS-based architecture requires the cen- range and most distributed machine learning platforms sup-
tralized servers to synchronize the parameters for workers, port data parallelism, such as Tensorflow Abadi et al. (2016),
the communication between servers and workers can become MXNet Chen et al. (2015), Li et al. (2014a), Petuum Xing
potential bottlenecks. Especially when there are much more et al. (2015), etc.
workers than servers, the high concurrency caused by work-
ers will bring much pressure to servers and further affect the 4.2.2 Model parallelism
overall performance. On the contrary, Ring-based architec-
ture follows a decentralized principle and each node shares Some practical businesses may require a huge training model
the workload (both computation and communication) evenly. containing billions of parameters, which is too difficult for a
Compared with PS-based architecture, there is no significant single machine to solve. Model parallelism tries to split the
communication bottleneck as the number of nodes reaches model parameters into several sections and reduce the inter-
a large scale. section between each part. In this way, some training steps
Besides, it is contradictive to compare the two archi- in the training models can be executed independently with
tectures in scalability. As for PS-based architecture, fresh their own parameters and the overall efficiency can be much
workers can join the current system without much effect improved by parallelizing these steps. GraphLab Low et al.
on other workers. On the other hand, in order to deploy (2014) and Tux2 Xiao et al. (2017) are considered to be two
more nodes into Ring-based architecture, the original ring typical works in this direction, both integrated with novel
has to be changed and the logic of scatter and gather on splitting techniques to support model parallelism. They both
existing nodes will be rearranged. From this perspective, adopt innovative model decomposition strategies: GraphLab

13

10 Y. Cheng et al.

takes vertex as the minimum granularity and distributes the Remote Direct Memory Access (RDMA) Archer and
vertexes into different nodes. The edges reflect the depend- Blocksome (2012) is another high-performance communi-
ency between vertexes and multiple copies of edges are cation protocol, which is aimed to access memory informa-
stored to avoid the loss of dependency information. Tux2 , on tion on the other machine directly. RDMA can minimize the
the other hand, cut vertexes and replicate them into several overhead of processing packet and latency with the assist
copies stored in several nodes. Such a design proves to be of dependable protocol implemented on hardware, zero
effective to handle power-law graphs and better match PS- copy and kernel by-pass technologies. With those features,
based architecture. RDMA can achieve 100 Gbps throughput and less than
5 𝜇 s latency. RDMA has great advantages over TCP and
4.2.3 Synchronization and asynchronization been applied to distributed machine learning system such
as tensorflow Jia et al. (2017), Abadi et al. (2016). To fur-
Synchronization and asynchronization is a concerning issue ther release the potentiality in RDMA, GPUDirect RDMA
in either Parallelism mode: Synchronization can sometimes gpudirect (2018) enables a direct path for data exchange
cause serious communication costs. Asynchronization, on between the GPU and a third-party peer device using stand-
the other hand, will lead to a frustrating result and incur ard features of PCIe (e.g. network interfaces), instead of the
more iterations. Since either mechanism is not perfect, there assistance of CPU, which incurs extra copying and latency.
are some strategies proposed to combine the benefits of the Related work Yi et al. (2017) is trying to introduce GPU-
two mechanisms. Direct RDMA to improve the performance of distributed
K-bounded delay Li et al. (2014a) can be regarded as a machine learning system.
trade-off of synchronization and asynchronization in model Recently, Systems with multiple GPUs and CPUs are
updates. It relaxes the synchronization constraints and allows becoming common in AI computing. These GPUs and CPUs
the fastest worker node can surpass the slowest one for no communicate with each other via PCIe. However, GPUs is
more than K rounds. Only when the gap goes beyond K rounds, gaining more and more computation ability, the traditional
the fastest node will be blocked for synchronization. K is a PCIe bandwidth is increasingly becoming the bottleneck at
user-defined hyperparameter and varies in different models. In the multi-GPU system level, driving the need for a faster and
particular, when K is set to zero, the K-bounded delay mecha- more scalable multiprocessor interconnect. The NVIDIA
nism turns to the synchronous one. When K is infinite, the NVLink nvlink et al. (2018) technology addresses this inter-
K-bounded delay mechanism turns to the asynchronous one. connection issue by providing higher bandwidth, more links,
and improved scalability for multi-GPU and multi-GPU/
4.3 Communication optimization CPU system configurations. A single NVIDIA Tesla V100
GPU supports up to six NVLink connections and a total
The poor performance of distributed system has much to do bandwidth of 300 GB/sec10X the bandwidth of PCIe Gen
with the communication latency. This is even more distinc- 3. Servers like the new NVIDIA DGX-1 dgx (2017) take
tive with the popularity of GPU-based computing such as advantage of these technologies to gain greater scalability
NVIDIA nvidia et al. (2017) and AMD amd (2017). GPU for ultrafast deep learning training.
is efficient for parallel computing with a lot of computing
cores, which is in great need of an efficient communication 4.3.2 Data compression and communication filter
mechanism. The increasing training scale can incur expen-
sive communication cost and cause severe bottlenecks for Usually, the parameters are stored and transferred in the key-
the platform. To mitigate the communication bottleneck and value format, which may cause redundancy because values
achieve a satisfactory performance, there are some tricks can be small (floats or integers) and the same keys are trans-
worth considering. ferred during each interaction between server and worker. To
mitigate this, several tricky strategies are adopted:
4.3.1 Efficient communication protocols Transferring the updated portion of parameters Li et al.
(2014a), Hsieh et al. (2017) Since parameters in model are
As one of the most typical communication protocols, TCP represented as structured mathematical objects, such as vec-
is widely applied to distributed machine learning systems. tors, matrices, or tensors and typically a part of the object
However, the drawbacks seriously damage the system per- is updated at each iteration, only partial or full matrix is
formance, such as slow start, naive congestion control and transferred between them, thus greatly reducing the com-
high latency. Inspired by this, many researchers try to intro- munication cost.
duce more efficient communication protocol to improve the Transferring the values instead of key-value pairs Due
scalability and distribution, including RDMA, GPUDirect to the range-based push and pull, a range of key-value pairs
RDMA, NVLink, etc. is communicated at each iteration. When the same range is

13
Bridging machine learning and computer network research: a survey 11

chosen again, it is likely that only values are changed while gradients based on the former values in the last iteration
the keys are unmodified. If both the sender and receiver have and local training dataset, then it collects the previous
cached these keys, only the values with a signature of the parameters from its neighbors (there will be two neighbors
keys need to be transferred between them. Therefore, the in the ring-based topology) and average the three copies of
network bandwidth is effectively doubled Li et al. (2014b). parameters (i.e. the two copies from neighbors as well as the
Compressing the transferred data Since the values trans- local parameters on itself ) to replace the local parameters.
ferred is compressible numbers, such as zeros, small integers, Finally, it uses the calculated gradients to update the fresh
and 32-bit floats with an excessive level of precision, commu- parameters D-PSGD is proven to have the same computa-
nication cost can be reduced by using lossless or lossy data tion complexity as well as smoother bandwidth consump-
compression algorithms, Li et al. (2014b) compress the sparse tion over the traditional topology such as PS-based topol-
matrix by eliminating most zeros values, gRPC Abadi et al. ogy. Compared to AllReduce, it gains better performance
(2016) eliminated the redundancy to decrease the transferred by reducing the number of communication in high latency
data by novel compression algorithms, Wei et al. (2015) used network. Besides, it follows parallel workflow and overlaps
a 16-bit float to replace 32-bit float value to improve the uti- the time for parameters collection and gradients updates,
lization of bandwidth and Chilimbi et al. (2014), Zhang et al. thus gaining better performance.
(2017) decompose the gradient in Full-Connection layer as
two vectors to decrease the transferred data.
Besides, Li et  al. (2014b), Hsieh et  al. (2017) have
5 Future direction for network & ML
observed that many updates in one iteration is negligible in
changing parameters, to balance the computation efficiency
The advent of ML revolutions brings fresh vitality to com-
and communication, Li et al. (2014b), Chen et al. (2015)
puter network research whereas the improvement of network
adopted a KKT (Karush Kuhn Tucker) threshold to filter the
performance also provides better support for ML computa-
most insignificant updates and just transfer those updates
tions. The combination of computer network and ML tech-
which can dramatically affect the parameters while keeps
nology is a frontier area and many open issues still remain
the convergence; Gaia Hsieh et al. (2017) has also adopted
to be explored. Generally speaking, the future research will
a dynamic filter threshold to get the significant updates and
focus on the two main dimensions.
make it efficient training a model over WAN.

4.3.3 Batch computation 5.1 Network by ML

Gradient descent (GD) has been widely used in various kinds Lots of network-related challenges are expected to be solved
of distributed machine learning scenarios, which requires or mitigated with the integration of ML technologies. As
considerable communication between GPUs because each introduced in the paper, QoS optimization and Traffic analy-
GPU must exchange both gradients and parameter values on sis gain much benefit from the machine learning techniques.
every update step. To reduce the communication cost, the “Network by AI” will remain as a hot topic, which aims to
batch computation idea is then applied to the optimization of adopt ML technologies to solve network problems. Towards
GD and becomes a prevalent method in distributed machine this direction, some major points should be concerned.
learning. Mini-batch gradient descent (MBGD) Cotter et al.
(2011) divides the training data into several parts (each part 1. Data Data collection is a key step for most ML tech-
is called one batch) and uses one batch to update param- niques. The quality and quantity of data can significantly
eters. Although large mini-batches are prone to reduce more affect the following modeling and training process.
communication cost, they may slow down convergence rate However, network-related data may touch the individ-
in practice Byrd et al. (2012). Thus, the size of mini-batch ual privacy and usually unavailable. For instance, the
should not be set too large. With the moderate batch size, encryption improves the barrier for data accessing and
parameters can be updated frequently. Meanwhile, com- fails many analytical methods. Further, when the data is
pared with stochastic gradient descent (SGD), which uses accessible, the preprocessing of the network data also
one sample each time to update its model, MBGD retains requires special consideration. Noisy data and irrel-
good convergence properties Li et al. (2014c) and enjoys a evant features may damage the accuracy of the training
better robustness to noises since the batch data can smooth models. The filtering and cleaning of network data are
the biases caused by the noisy points. D-PSGD Lian et al. expected to involve much effort and skills. The lack of
(2017) is recently proposed to utilize ring-based topology labeled network data is also a big challenge.
for improving distributed machine learning performance. 2. Modeling Confronted with a variety of models and train-
During each iteration of D-PSGD, each node calculates ing algorithms, it can be difficult to make the proper

13

12 Y. Cheng et al.

choices that match the scenario. In the prior works, some eter in training, so how to design a system with slack
classic models and methods are employed to solve the fault tolerance to improve the efficiency of ML system
network problems, such as basic SVM, linear regression, is an open issue.
etc. To gain a better performance, more advanced mod-
els are applied to better fit the practical cases. No doubt Network has always played a fundamental role in com-
deep learning and reinforcement learning provide more puter engineering. The recent development of ML tech-
powerful tools for complex network problems. However, nology brings lots of novel ideas and methods for network
the modeling of a training process should be conducted research. It is believed that the combination of network and
with a full understanding of the practical problems. The ML will generate more innovations and create more values
abuse of deep learning and reinforcement learning may in the near future.
not gain much benefit.
Acknowledgements  This work is supported by the National Natural
Science Foundation of China under Grants No. 61772305.
5.2 Network for ML

With the popularity of deep learning and reinforcement


learning, which both impose an increasing demand of com-
putation capacity. One direction is to improve the perfor-
References
mance of computational ability for single machine with AMD.: Accelerators for High Performance Compute. http://www.amd.
advanced processors (i.e. TPU, DGX-1, HPC, etc.). Distrib- com/en-us/solut​ions/profe​ssion​al/hpc (2017)
uted machine learning can be another competitive solution, Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin,
where ML can benefit a lot from high-performance network M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg,
J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.,
techniques. Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensor-
Although there are some novel works on improving the flow: A system for large-scale machine learning. In: 12th USENIX
efficiency of ML platform with high-performance network, Symposium on Operating Systems Design and Implementation
it still remains as an active research area in both academia (OSDI 16). pp. 265–283. USENIX Association, GA (2016)
Antonakakis, M., Perdisci, R., Nadji, Y., Vasiloglou, N., Abu-Nimeh,
and industry, which has involved much effort but still leaves S., Lee, W., Dagon, D.: From throw-away traffic to bots: Detect-
many open issues: ing the rise of dga-based malware. In: Presented as part of the
21st USENIX Security Symposium (USENIX Security 12). pp.
1. Network topology There are two main topologies used 491–506. USENIX, Bellevue, WA (2012)
Archer, C., Blocksome, M.: Remote direct memory access https:​ //www.
today: centralized topology (PS-based) and decentral- googl​e.com/paten​ts/US832​5633. US Patent 8,325,633 (2012)
ized topology (ring-based), these two topologies still Ashfaq, A.B., Javed, M., Khayam, S.A., Radha, H.: An information-
have some drawbacks that hamper the scalability and theoretic combining method for multi-classifier anomaly detection
performance. For instance, bandwidth on the server systems. In: 2010 IEEE International Conference on Communica-
tions. pp. 1–5 (2010)
node will become the bottleneck of the whole machine Ballani, H., Costa, P., Karagiannis, T., Rowstron, A.: Towards predict-
for centralized topology, on the other hand, ring-based able datacenter networks, pp. 242–253. SIGCOMM., ’11, ACM,
topology lacks fault tolerance, which is infeasible in New York, NY, USA (2011)
practice. As the aforementioned drawbacks hamper the Bao, Y., Wu, H., Liu, X.: From prediction to action:a closed-loop
approach for data-guided network resource allocation. In: In Pro-
performance, an ideal topology with the advantage of ceedings of the SigKDD ’16 Conference. pp. 1425–1434 (2016)
centralized and decentralized could benefit the perfor- Baralis, E.M., Mellia, M., Grimaudo, L.: Self-learning classifier for
mance of distributed machine learning. internet traffic (2013)
2. Network protocol The reduction of communication cost Bartos, K., Sofka, M., Franc, V.: Optimized invariant representation
of network traffic for detecting unseen malware variants. In: 25th
is also a key concern in ML platform. Recent works USENIX Security Symposium (USENIX Security 16). pp. 807–
(such as MPI, RDMA, GPUDirect RDMA, etc.) greatly 822. USENIX Association, Austin, TX (2016)
mitigate the communication bottleneck. However, some Bartos, K., Sofka, M., Franc, V.: Optimized invariant representation of
drawbacks like the naive flow control are inefficient in network traffic for detecting unseen malware variants. In: USE-
NIX Security Symposium. pp. 807–822 (2016)
large scale of network and downgrade the throughput Borgolte, K., Kruegel, C., Vigna, G.: Meerkat: Detecting website
in reality. So it can still be optimized with communica- defacements through image-based object recognition. In: 24th
tion pattern to improve the performance of distributed USENIX Security Symposium (USENIX Security 15). pp. 595–
machine learning. 610. USENIX Association, Washington, DC (2015)
Botezatu, M.M., Giurgiu, I., Bogojeska, J., Wiesmann, D.: Predicting
3. Fault tolerance Fault tolerance is a long-term concern disk replacement towards reliable data centers. In: ACM SIGKDD
for both network infrastructure and it will also play a International Conference on Knowledge Discovery and Data Min-
significant role in ML platform construction. Different to ing, pp. 39–48 (2016)
other applications, it is less sensitive in updating param-

13
Bridging machine learning and computer network research: a survey 13

Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection Jayaraj, A., Venkatesh, T., Murthy, C.S.R.: Loss classification in
in optimization methods for machine learning. Math. Program. optical burst switching networks using machine learning tech-
134(1), 127–155 (2012) niques: improving the performance of TCP. IEEE J. Sel. Areas
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, Commun. 26(6), 45–54 (2008)
B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine Jia, C., Liu, J., Jin, X., Lin, H., An, H., Han, W., Wu, Z., Chi, M.:
learning library for heterogeneous distributed systems. CoRR Improving the performance of distributed tensorflow with
abs/1512.01274 (2015) RDMA. Int. J. Parallel Program. 3, 1–12 (2017)
Chilimbi, T., Suzue, Y., Apacible, J., Kalyanaraman, K.: Project adam: Jiang, J., Sekar, V., Milner, H., Shepherd, D., Stoica, I., Zhang, H.:
Building an efficient and scalable deep learning training system. CFA: A practical prediction system for video QoE optimization.
In: 11th USENIX Symposium on Operating Systems Design and In: NSDI, pp. 137–150 (2016)
Implementation (OSDI 14). pp. 571–582. USENIX Association, Li, D., Chen, C., Guan, J., Zhang, Y., Zhu, J., Yu, R.: Dcloud: Dead-
Broomfield, CO (2014) line-aware resource allocation for cloud computing jobs. IEEE
Chowdhury, M., Stoica, I.: Efficient coflow scheduling without prior Trans. Parallel Distrib. Syst. 27(8), 2248–2260 (2016)
knowledge, pp. 393–406 (2015) Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josi-
Comar, P.M., Liu, L., Saha, S., Tan, P.N., Nucci, A.: Combining super- fovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed
vised and unsupervised learning for zero-day malware detection. machine learning with the parameter server. In: Proceedings of
In: 2013 Proceedings IEEE INFOCOM, pp. 2022–2030 (2013) the 11th USENIX Conference on Operating Systems Design and
Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch Implementation, pp. 583–598. OSDI’14, USENIX Association,
algorithms via accelerated gradient methods. In: Shawe-Taylor, Berkeley, CA, USA (2014a)
J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication effi-
Advances in Neural Information Processing Systems 24, pp. cient distributed machine learning with the parameter server.
1647–1655. Curran Associates, Inc. (2011) In: International conference on neural information processing
Das, A.K., Pathak, P.H., Chuah, C.N., Mohapatra, P.: Contextual locali- systems, MIT Press, Cambridge, pp. 19–27 (2014b)
zation through network traffic analysis. In: INFOCOM, 2014 Pro- Li, M., Zhang, T., Chen, Y., Smola, A.J.: Efficient mini-batch train-
ceedings IEEE, pp. 925–933. IEEE (2014) ing for stochastic optimization. In: Proceedings of the 20th
Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., ACM SIGKDD international conference on Knowledge dis-
Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, covery and data mining, pp. 661–670. ACM (2014c)
A.Y.: Large scale distributed deep networks, pp. 1223–1231. Li, W., Zhou, F., Meleis, W., Chowdhury, K.: Learning-based and
Associates Inc., USA, NIPS’12, Curran (2012) data-driven tcp design for memory-constrained iot. In: Distrib-
Dong, M., Li, Q., Zarchy, D., Godfrey, P.B., Schapira, M.: Pcc: re- uted Computing in Sensor Systems, pp. 199–205. IEEE (2016)
architecting congestion control for consistent high performance. Li, X., Bian, F., Crovella, M., Diot, C., Govindan, R., Iannaccone,
NSDI 1, 2 (2015) G., Lakhina, A.: Detection and identification of network anoma-
Fontugne, R., Borgnat, P., Abry, P., Fukuda, K.: Mawilab:combining lies using sketch subspaces. In: ACM SIGCOMM Conference
diverse anomaly detectors for automated anomaly labeling and on Internet Measurement, pp. 147–152 (2006)
performance benchmarking. In: International Conference of Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., Liu, J.: Can
CoNext, pp. 1–12 (2010) decentralized algorithms outperform centralized algorithms? a
Foundation, T.A.S.: Hadoop project.http://hadoo​p.apach​e.org/core/ case study for decentralized parallel stochastic gradient descent
(2009) (2017)
Franc, V., Sofka, M., Bartos, K.: Learning detector of malicious net- Liu, D., Zhao, Y., Sui, K., Zou, L., Pei, D., Tao, Q., Chen, X., Tan,
work traffic from weak labels. In: Joint European Conference on D.: Focus: Shedding light on the high search response time in
Machine Learning and Knowledge Discovery in Databases, pp. the wild. In: IEEE INFOCOM 2016—the IEEE International
85–99. Springer (2015) Conference on Computer Communications, pp. 1–9 (2016)
Franc, V., Sofka, M., Bartos, K.: Learning detector of malicious net- Liu, D., Zhao, Y., Xu, H., Sun, Y., Pei, D., Luo, J., Jing, X., Feng,
work traffic from weak labels. In: Proceedings, Part III, of the M.: Opprentice: towards practical and automatic anomaly detec-
European Conference on Machine Learning and Knowledge Dis- tion through machine learning, Tokyo, Japan (2015)
covery in Databases —Volume 9286. pp. 85–99. ECML PKDD Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C., Hel-
2015, Springer, New York, Inc., New York, NY, USA (2015) lerstein, J.M.: Graphlab: a new framework for parallel machine
Furno, A., Fiore, M., Stanica, R.: Joint spatial and temporal classifi- learning. CoRR abs/1408.2041 (2014)
cation of mobile traffic demands. In: INFOCOM—36th Annual Lutu, A., Bagnulo, M., Cid-Sueiro, J., Maennel, O.: Separating wheat
IEEE International Conference on Computer Communications from chaff: Winnowing unintended prefixes using machine
(2017) learning. In: IEEE INFOCOM 2014—IEEE Conference on
Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., Computer Communications, pp. 943–951 (2014)
Lu, S.: Bcube:a high performance, server-centric network archi- Ma, S., Jiang, J., Li, B., Li, B.: Maximizing container-based net-
tecture for modular data centers, pp. 63–74 (2009) work isolation in parallel computing clusters. In: Edition of
Hayes, J., Danezis, G.: k-fingerprinting: a robust scalable website fin- the IEEE International Conference on Network Protocols, pp.
gerprinting technique. In: 25th USENIX Security Symposium 1–10 (2016)
(USENIX Security 16), pp. 1187–1203. USENIX Association, Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource manage-
Austin, TX (2016) ment with deep reinforcement learning, pp. 50–56., HotNets ’16,
He, T., Goeckel, D., Raghavendra, R., Towsley, D.: Endhost-based short- ACM, New York, NY, USA (2016)
est path routing in dynamic networks: An online learning approach. Mao, H., Netravali, R., Alizadeh, M.: Neural adaptive video streaming
In: INFOCOM, 2013 Proceedings IEEE, pp. 2202–2210 (2013) with pensieve, pp. 197–210. ACM (2017)
Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Mirza, M., Sommers, J., Barford, P., Zhu, X.: A machine learning
Gibbons, P.B., Mutlu, O.: Gaia: Geo-distributed machine learn- approach to tcp throughput prediction. In: ACM SIGMETRICS
ing approaching LAN speeds. In: 14th USENIX Symposium on Performance Evaluation Review, vol.  35, pp. 97–108. ACM
Networked Systems Design and Implementation (NSDI 17), pp. (2007)
629–647. USENIX Association, Boston, MA (2017)

13

14 Y. Cheng et al.

NVIDIA.: GPU APPLICATIONS: transforming computational Winstein, K., Balakrishnan, H.: Tcp ex machina: computer-generated
research and engineering. http://www.nvidi​a.com/objec​t/machi​ congestion control. In: ACM SIGCOMM Computer Communi-
ne-learn​ing.html (2017) cation Review, vol. 43, pp. 123–134. ACM (2013)
NVIDIA.: Developing a linux kernel module using gpudirect rdma. Xiao, W., Xue, J., Miao, Y., Li, Z., Chen, C., Wu, M., Li, W., Zhou,
https​://docs.nvidi​a.com/cuda/gpudi​rect-rdma/index​.html (2018) L.: ­Tux2: Distributed graph computation for machine learning.
NVIDIA.: Nvlink fabric. https​://www.nvidi​a.com/en-us/data-cente​r/ In: 14th USENIX Symposium on Networked Systems Design
nvlin​k/ (2018) and Implementation (NSDI 17), pp. 669–682. USENIX Asso-
NVIDIA: Nvidia dgx-1: the fastest deep learning system.https​://devbl​ ciation, Boston, MA (2017)
ogs.nvidi​a.com/paral​lelfo​rall/dgx-1-faste​st-deep-learn​ing-syste​ Xie, D., Ding, N., Hu, Y.C., Kompella, R.: The only constant is
m/ (2017) change: incorporating time-varying network reservations in
Nandi, A., Mandal, A., Atreja, S., Dasgupta, G.B., Bhattacharya, S.: data centers. ACM Sigcomm. Comput. Commun. Rev. 42(4),
Anomaly detection using program control flow graph mining from 199–210 (2012)
execution logs, pp. 215–224., KDD ’16, ACM, New York, NY, Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X.,
USA (2016) Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distrib-
Neuvirth, H., Finkelstein, Y., Hilbuch, A., Nahum, S., Alon, D., Yom- uted machine learning on big data. IEEE Trans. Big Data 1(2),
Tov, E.: Early detection of fraud storms in the cloud. In: Proceed- 49–67 (2015)
ings, Part III, of the European Conference on Machine Learn- Xu, Q., Liao, Y., Miskovic, S., Mao, Z.M., Baldi, M., Nucci, A.,
ing and Knowledge Discovery in Databases —volume 9286, pp. Andrews, T.: Automatic generation of mobile app signatures
53–67. ECML PKDD 2015, Springer-Verlag New York, Inc., New from traffic observations. In: 2015 IEEE Conference on Computer
York, NY, USA (2015) Communications (INFOCOM), pp. 1481–1489 (2015)
Nunes, B.A., Veenstra, K., Ballenthin, W., Lukin, S., Obraczka, : K.: Xu, Y., Yao, J., Jacobsen, H.A., Guan, H.: Cost-efficient negotiation
A machine learning approach to end-to-end rtt estimation and its over multiple resources with reinforcement learning. Spain, Bar-
application to tcp, pp. 1–6. IEEE (2011) celona (2016)
Research., B.: Bringing HPC techniques to deep learning. http://resea​ Yamada, M., Kimura, A., Naya, F., Sawada, H.: Change-point detec-
rch.baidu​.com/bring​ing-hpc-techn​iques​-deep-learn​ing/ (2017) tion with feature selection in high-dimensional time-series data.
Santiago del Rio, P.M., Rossi, D., Gringoli, F., Nava, L., Salgarelli, L., J. Catalysis 111(1), 50–58 (2013)
Aracil, J.: Wire-speed statistical classification of network traffic Yi, B., Xia, J., Chen, L., Chen, K.: Towards zero copy dataflows using
on commodity hardware, pp. 65–72. ACM (2012) rdma. In: Proceedings of the SIGCOMM Posters and Demos, vol.
Sivaraman, A., Winstein, K., Thaker, P., Balakrishnan, H.: An experi- 2017. ACM (2017)
mental study of the learnability of congestion control. In: ACM Zadrozny, B.: Learning and evaluating classifiers under sample selec-
SIGCOMM Computer Communication Review, vol.  44, pp. tion bias, pp. 114, ICML ’04, ACM, New York, NY, USA (2004)
479–490. ACM (2014) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.:
Soska, K., Christin, N.: Automatically detecting vulnerable websites Spark: Cluster computing with working sets. In: Proceedings of
before they turn malicious. In: 23rd USENIX Security Sympo- the 2Nd USENIX Conference on Hot Topics in Cloud Comput-
sium (USENIX Security 14), pp. 625–640, USENIX Association, ing. pp. 10–10. HotCloud’10, USENIX Association, Berkeley,
San Diego, CA (2014) CA, USA (2010)
Soule, A., Taft, N.: Combining filtering and statistical methods for Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei,
anomaly detection. In: Conference on Internet Measurement 2005, J., Xie, P., Xing, E.P.: Poseidon: An efficient communication
Berkeley, California, Usa, pp. 31–31 (2005) architecture for distributed deep learning on GPU clusters. In:
Stringhini, G., Kruegel, C., Vigna, G.: Shady paths: Leveraging surfing 2017 USENIX Annual Technical Conference (USENIX ATC 17),
crowds to detect malicious web pages, pp. 133–144., CCS ’13, pp. 181–193. USENIX Association, Santa Clara, CA (2017)
ACM, New York, NY, USA (2013) Zhang, R., Qi, W., Wang, J.: Cross-vm covert channel risk assessment
Sun, Y., Yin, X., Jiang, J., Sekar, V., Lin, F., Wang, N., Liu, T., Sin- for cloud computing: An automated capacity profiler. In: 2014
opoli, B.: Cs2p: Improving video bitrate selection and adapta- IEEE 22nd International Conference on Network Protocols, pp.
tion with data-driven throughput prediction. In: Proceedings of 25–36 (2014)
the 2016 conference on ACM SIGCOMM 2016 Conference, pp. Zhang, X., Wu, C., Li, Z., Lau, F.C.M.: Proactive vnf provisioning with
272–285, ACM (2016) multi-timescale cloud resources: fusing online learning and online
Tan, H., Han, Z., Li, X., Lau, F.C.M.: Online job dispatching and optimization. In: IEEE INFOCOM 2017-IEEE Conference on
scheduling in edge-clouds (2017) Computer Communications (INFOCOM), pp. 1–9. IEEE (2017)
Taylor, V.F., Spolaor, R., Conti, M., Martinovic, I.: Appscanner: Auto- Zhang, Z., Zhang, Z., Lee, P.P., Liu, Y., Xie, G.: Proword: an unsu-
matic fingerprinting of smartphone apps from encrypted network pervised approach to protocol feature word extraction. In: INFO-
traffic. In: 2016 IEEE European Symposium on Security and Pri- COM, 2014 Proceedings IEEE, pp. 1393–1401. IEEE (2014)
vacy (EuroS P). pp. 439–454 (March 2016) Zheng, R., Le, T., Han, Z.: Approximate online learning for passive
Wang, G., Wang, T., Zheng, H., Zhao, B.Y.: Man vs. machine: Practi- monitoring of multi-channel wireless networks. Proc. IEEE INFO-
cal adversarial detection of malicious crowdsourcing workers. In: COM 12(11), 3111–3119 (2013)
23rd USENIX Security Symposium (USENIX Security 14), pp. Zheng, N., Bai, K., Huang, H., Wang, H.: You are how you touch:
239–254. USENIX Association, San Diego, CA (2014) User verification on smartphones via tapping behaviors. In: 2014
Wang, W., Zhang, Q.: A stochastic game for privacy preserving con- IEEE 22nd International Conference on Network Protocols, pp.
text sensing on mobile phone. In: IEEE INFOCOM 2014—IEEE 221–232 (2014)
Conference on Computer Communications, pp. 2328–2336 (2014) Zhou, X., Wang, K., Jia, W., Guo, M.: Reinforcement learning-based
Wang, Z.: The applications of deep learning on traffic identification. adaptive resource management of differentiated services in geo-
BlackHat USA (2015) distributed data centers. Spain, Barcelona (2016)
Wei, J., Dai, W., Qiao, A., Ho, Q., Cui, H., Ganger, G.R., Gibbons, Zhu, J., Li, D., Wu, J., Liu, H., Zhang, Y., Zhang, J.: Towards band-
P.B., Gibson, G.A., Xing, E.P.: Managed communication and width guarantee in multi-tenancy cloud computing networks. In:
consistency for fast data-parallel iterative analytics, pp. 381– IEEE International Conference on Network Protocols, pp. 1–10
394. ACM (2015) (2012)

13
Bridging machine learning and computer network research: a survey 15

Yang Cheng  is currently a Ph.D. Ph.D. degree with the Department of Computer Science and Technol-
student in the Department of ogy, Tsinghua University, Beijing, China. His research interests include
Computer Science, Tsinghua data center networking, software defined networking and cloud
University. His research interest computing.
includes networking system, dis-
tr ibuted machine lear ning
system.

Dan Li  received the Ph.D. degree


in computer science from Tsing-
hua University, Beijing, China,
in 2007. He is an Associate Pro-
fessor with the Computer Sci-
ence Department, Tsinghua Uni-
versity. His research interests
include Future Internet architec-
ture and data center networking.

Jinkun Geng  is currently a master


student in Department of Com-
puter Science, Tsinghua Univer-
sity. His research interest
includes high-performance net-
working, data center networking,
large-scale distributed machine
learning.
Jianping Wu  received the master
and doctoral degrees in computer
science from Tsinghua Univer-
sity, Beijing, China, in 1997. He
is now a Full Professor with the
Computer Science Department,
Tsinghua University. In the
research areas of the network
Yanshu Wang  is currently a Ph.D. architecture, high performance
student in the Department of routing and switching, protocol
Computer Science, Tsinghua testing, and formal methods, he
University. His research interest has published more than 200
includes the networking conges- technical papers in academic
tion control and machine journals and proceedings of
learning. international conferences.

Junfeng Li  received the B.E.


degree in information engineer-
ing from the Xi’an Jiaotong Uni-
versity, Xi’an, China, in 2015.
He is currently pursuing the

13

View publication stats

You might also like