Machine Learning For Cyber Security 1st Edition by Preeti Malik, Lata Nautiyal, Mangey Ram 3110766736Â 978-3110766738 PDF Download
Machine Learning For Cyber Security 1st Edition by Preeti Malik, Lata Nautiyal, Mangey Ram 3110766736Â 978-3110766738 PDF Download
https://ebookball.com/product/machine-learning-for-cyber-
security-1st-edition-by-preeti-malik-lata-nautiyal-mangey-
ram-3110766736-978-3110766738-20150/
https://ebookball.com/product/handbook-of-research-on-machine-
and-deep-learning-applications-for-cyber-security-1st-edition-by-
padmavathi-ganapathi-1522596143-9781522596141-20028/
Game Theory and Machine Learning for Cyber Security 1st Edition by
Charles Kamhoua, Christopher Kiekintveld, Fei Fang, Quanyan Zhu
https://ebookball.com/product/game-theory-and-machine-learning-
for-cyber-security-1st-edition-by-charles-kamhoua-christopher-
kiekintveld-fei-fang-quanyan-zhu-17020/
https://ebookball.com/product/machine-learning-for-computer-and-
cyber-security-principle-algorithms-and-practices-1st-edition-by-
brij-gupta-quan-sheng-9780429756306-0429756305-16804/
Cyber Security Meets Machine Learning 1st edition by Xiaofeng Chen,
Willy Susilo, Elisa Bertino 9813367261 9789813367265
https://ebookball.com/product/cyber-security-meets-machine-
learning-1st-edition-by-xiaofeng-chen-willy-susilo-elisa-
bertino-9813367261-9789813367265-19988/
https://ebookball.com/product/cyber-security-cryptology-and-
machine-learning-1st-edition-by-shlomi-dolev-jonathan-katz-amnon-
meisels-3031769341-9783031769344-20008/
https://ebookball.com/product/machine-learning-approaches-in-
cyber-security-analytics-1st-edition-by-tony-thomas-athira-
vijayaraghavan-sabu-emmanuel-9811517061-9789811517068-19996/
https://ebookball.com/product/machine-learning-approaches-in-
cyber-security-analytics-1st-edition-by-tony-thomas-athira-
vijayaraghavan-sabu-emmanuel-9811517061-9789811517068-19992/
https://ebookball.com/product/machine-learning-in-cyber-trust-
security-privacy-and-reliability-1st-edition-by-jeffrey-tsai-
philip-yu-0387887350-9780387887357-20052/
Machine Learning for Cyber Security
De Gruyter Series on the
Applications of Mathematics
in Engineering and
Information Sciences
Edited by
Mangey Ram
Volume 15
Machine Learning
for Cyber Security
Edited by
Preeti Malik, Lata Nautiyal and Mangey Ram
Editors
Dr. Preeti Malik
Graphic Era University
CSIT Block
Bell Road, Clement Town
Dehradun 248001
Uttarakhand
India
[email protected]
[email protected]
ISBN 978-3-11-076673-8
e-ISBN (PDF) 978-3-11-076674-5
e-ISBN (EPUB) 978-3-11-076676-9
ISSN 2626-5427
www.degruyter.com
Preface
Cyber threats today are one of the expensive losses that an organization can face.
Today, it is impossible to deploy effective cybersecurity technology without relying
heavily on advanced techniques like machine learning and deep learning. Cyberse-
curity is a growing challenge in the era of Internet. This book addresses questions
of how machine learning methods can be used to advance cybersecurity objectives,
including detection, modeling, monitoring, and analysis of as well as defense
against various threats to sensitive data and security systems. Filling an important
gap between machine learning and cybersecurity communities, it discusses topics
covering a wide range of modern and practical machine learning techniques, frame-
works, and development tools to enable readers to engage with the cutting-edge re-
search across various aspects of cybersecurity. The book focuses on mature and
proven techniques, and provides ample examples to help readers grasp the key
points. This cybersecurity book presents and demonstrates popular and successful
artificial intelligence approaches and models that you can adapt to detect potential
attacks and protect your corporate systems.
This book will assist readers in putting intelligent answers to current cyberse-
curity concerns into practice and in creating cutting-edge implementations that
meet the demands of ever-more complex organizational structures. By the time you
finish reading this book, you will be able to create and employ machine learning
algorithms to mitigate cybersecurity risks.
https://doi.org/10.1515/9783110766745-202
Contents
Preface V
List of contributors IX
Editor’s biography XI
Sangeeta Mittal
A review of machine learning techniques in cybersecurity and research
opportunities 91
Vasu Thakur, Vikas Kumar Roy, Nikhil Baliyan, Nupur Goyal, Rahul Nijhawan
A framework for seborrheic keratosis skin disease identification using Vision
Transformer 117
Index 145
List of contributors
1. Preeti Malik 10. Samuel Wedaj Kibret
Graphic Era Deemed to be University, Indian Institute of Technology Delhi,
Dehradun, India New Delhi, India
Email: [email protected] Email: [email protected],
[email protected]
2. Varsha Mittal
Graphic Era Deemed to be University, 11. Sangeeta Mittal
Dehradun, India Jaypee Institute of Information Technology,
Email: [email protected] Noida, Uttar Pradesh, India
Email: [email protected]
3. Mohit Mittal
INRIA Labs, France 12. Vasu Thakur
Email: [email protected] Department of Computer Science and
Engineering,
4. Kamika Chaudhary Roorkee Institute of Technology,
MB Govt. PG College, Haldwani, India Roorkee, India
Email: [email protected]
13. Vikas Kumar Roy
5. Abdul Rahman Department of Computer Science and
Middlesex University Dubai, Dubai, UAE Engineering,
Roorkee Institute of Technology,
6. Krishnadas Nanath Roorkee, India
Middlesex University Dubai, Dubai, UAE Email: [email protected]
Email: [email protected]
14. Nikhil Baliyan
7. Kiran Aswal Department of Computer Science and
Gurukul Kangri Viswavidyalaya, Engineering,
Dehradun Campus, Uttarakhand, India Roorkee Institute of Technology, Roorkee,
Email: [email protected] Uttarakhand, India
https://doi.org/10.1515/9783110766745-204
X List of contributors
Prof. Dr. Mangey Ram received his Ph.D. major in mathematics and minor
in computer science from G. B. Pant University of Agriculture and
Technology, Pantnagar, India. He has been a faculty member for around 12
years and has taught several core courses in pure and applied mathematics
at undergraduate, postgraduate, and doctorate levels. He is currently a
research professor at Graphic Era (Deemed to be University), Dehradun,
India. Before joining the Graphic Era, he was a deputy manager
(probationary officer) with Syndicate Bank for a short period. He is editor in
chief of International Journal of Mathematical, Engineering and Management Sciences, book series
editor with Elsevier, CRC Press-A Taylor and Frances Group, De Gruyter Publisher Germany, River
Publisher, USA, and the guest editor and member of the editorial board of various journals. He has
published 225 plus research publications in IEEE, Taylor & Francis, Springer, Elsevier, Emerald,
World Scientific, and many other national and international journals and conferences. His fields of
research are reliability theory and applied mathematics. Dr. Ram is a senior member of the IEEE,
life member of Operational Research Society of India, Society for Reliability Engineering, Quality
and Operations Management in India, and Indian Society of Industrial and Applied Mathematics.
He has been a member of the organizing committee of a number of international and national
conferences, seminars, and workshops. He has been conferred with Young Scientist Award by the
Uttarakhand State Council for Science and Technology, Dehradun, in 2009. He has been awarded
the Best Faculty Award in 2011, Research Excellence Award in 2015, and recently Outstanding
Researcher Award in 2018 for his significant contributions in academics and research at Graphic
Era Deemed to be University, Dehradun, India.
https://doi.org/10.1515/9783110766745-205
Preeti Malik✶, Varsha Mittal, Mohit Mittal, Kamika
Differential privacy: a solution to privacy
issue in social networks
Abstract: The privacy of social network data is becoming increasingly important,
threatening to limit access to this lucrative data source. The topological structure of
social networks can provide useful information for income production and social
science research, but it is challenging to ensure that this analysis does not breach
individual privacy. Differential privacy is a prominent privacy paradigm in data
mining over tabular data that employs noise to disguise individuals’ contributions
to aggregate findings and provides a very exceptional analytical guarantee that in-
dividuals’ existence in the data-set is hidden. Because social network analysis has
multiple applications, it opens up a new field for differential privacy applications.
This article provides a thorough examination of the fundamental principles of dif-
ferential privacy and their applications in computing.
✶
Corresponding author: Preeti Malik, Graphic Era Deemed to be University, Dehradun, India,
e-mail: [email protected]
Varsha Mittal, Graphic Era Deemed to be University, Dehradun, India
Mohit Mittal, INRIA Labs, France
Kamika, MB Govt. PG College, Haldwani, India
https://doi.org/10.1515/9783110766745-001
2 Preeti Malik et al.
In July 2020, DataReportal produced a unique report that examines changes in so-
cial media activity at the commencement of the COVID-19 lockdown period, in addi-
tion to usual enquiries. The amount of Internet and digital activities has increased
dramatically (see Figure 1).
1.2.1 Pros
https://raisingchildren.net.au/teens/entertainment-technology/digital-life/social-media#: ~:text
= Social%20media%3A%20risks,-Social%20media%20can&text = uploading%20inappropriate%
20content%2C%20like%20embarrassing,much%20targeted%20advertising%20and%20marketing]
on 25-2-2022.
Differential privacy: a solution to privacy issue in social networks 3
– Mental health and well-being: Interacting with people and friends on social
media provides an emotion of belonging and connection in your kid.
1.2.2 Cons
Social media may sometimes be dangerous. The dangers for your kids include:
– Uncovering aggressive or distressing information, like harsh, offensive, violent,
or sexual remarks or snaps.
– Sharing wrong content, for instance, snaps or videos that are uncomfortable or
suggestive.
– Sharing personal information on social media with strangers, for example, con-
tact number, birth date, or addresses. Privacy settings can limit who can view
information about your kids, such as their name, age, and where they reside.
One can misuse this information.
– One can become victim of cyberbullying.
anonymity [10]. Though all these models protect against a certain form of assault
and are unable to fight against newly invented attacks, the security of the model is
based on the hypothesis of some specific background information of an attacker,
which is a primary source of this flaw. Nonetheless, enumerating all conceivable
sorts of background information of attacker may have been very hard. As a result, a
model that preserves privacy while ignoring background knowledge is very desired.
Definition (Mapping query). In a set of individual profiles (P) in a social network G, find which
profile p maps to a particular individual i. Return p.
Definition (Existence query). For a particular individual i, find if this individual has a profile p in
the network G. Return true or false.
Definition (Co-reference resolution query). For two individual profiles pi and pj, find if they refer
to the same individual p. Return true or false.
To put it another way, identity disclosure means that the attacker can properly and
confidentially answer the mapping question. This is difficult to do if the attacker
knows unique properties of individual p that may be matched with observable at-
tributes of profiles in P. One technique to formalize identity disclosure for an indi-
vidual p is to assign a random variable vp that spans all of the network profiles. We
suppose that the attacker knows how to compute:
Differential privacy: a solution to privacy issue in social networks 5
Prðvp = pi Þ
There are three sorts of personal attributes, according to a prevalent theory in the
privacy literature:
– Identifying attributes – qualities that uniquely identify a person, such as a social
security number (SSN).
– Quasi-identifying attributes – a set of traits that may be used to uniquely iden-
tify a person, for example, person’s name and his/her address.
– Sensitive attributes – characteristics that an individual would like to keep pri-
vate, like political affiliation.
When an enemy discovers the presence of a delicate association between two users
that they would prefer to keep secret from the community, this is known as social
link disclosure. We suppose that an arbitrary variable ei,j is connected with the pres-
ence of a connection between two nodes ni and nj, and that an attacker has a
method for allocating a probability to ei,j, Pr(ei,j = true): ei,j →R, similar to the earlier
types of leaks.
Social networks, communication data, medical data, and other data sources all
contain examples of sensitive interactions. Based on a person’s friendship links and
the public likings of their friends, it may be feasible to deduce the person’s own
Alice Unknown
Father of
Is diabetics
Barbie Ken
Duke
preferences from social network data. In mobile phone communication data, dis-
covering that an anonymous person has made phone calls to a cell phone number
of a recognized organization might compromise the unknown person’s identity.
Knowing the familial links between persons who have been detected with genetic
illnesses and those who have not can assist to extrapolate the likelihood of healthy
persons developing these disorders in hereditary disease data.
Researchers have looked into social network attacks that disclose sensitive link-
ages [28–31]. Figure 2 shows some examples of sensitive links. Recent research has
also focused on sensitive edge features such as link strength [32, 33].
One more type of relational data privacy infringement is affiliation link disclosure,
which shows whether an individual fits in a certain affiliation group. It might also
be delicate to determine if two users are members of the same group. This type of
disclosure can lead to other three types of disclosures. As a result, keeping one’s
privacy requires concealing one’s affiliations.
Again, let us suppose that there is an arbitrary variable ep,h coupled with the
presence of an affiliation link between a profile p and a group h, and that an at-
tacker has a method to compute the probability of ep,h, Pr(ev,h = true): ev,h → R.
It is possible that one form of expose will give hint of another. Wondracek et al.
[19], for example, demonstrate a de-identification attack in which the revelation of
an affiliation connection can lead to the identity of a seemingly unidentified Inter-
net client. An attacker begins the assault by scanning a social networking Internet
site and gathering facts and data about its users’ affiliations in online social groups.
The identity of social network users is considered to be known. As per the informa-
tion gathered, each user who belongs to minimum one group has a group signature,
which is a list of the groups to which he belongs. The adversary then performs a
history theft attack (for additional information on the assault, see [19]), which cap-
tures the target Internet user’s online surfing history.
Search data is an illustration of affiliation connection disclosure which lead to
identity exposure. If we believe that users who submit search questions to a search
engine are members of the social network, and that the search questions they sub-
mit represent affiliation groups, at that point revealing the relationships between
query submitted and the user can aid the attacker in identifying members of the
network. Users engage with search engines in an unrestricted manner, disclosing a
great deal of personal data in the content of their requests. In 2006, an Internet ser-
vice provider, AOL, provided an “anonymized” sample of nearly half a million cus-
tomers and their queries to the AOL search engine, causing a scandal. The release
was well-meant, with the goal of augmenting search ranking studies with real-
world data.
8 Preeti Malik et al.
One of the issues with the provided data was that, despite being in table format,
the items were not self-contained. Shortly after the data was released, reporters
from the New York Times connected 454 search requests made by the same person,
which revealed enough personal information to identify that person – Thelma Ar-
nold, a 62-year-old widow from Lilburn, Georgia [34]. Her inquiries included infor-
mation on others with the same last name as hers, retirement, and her location.
As shown in a guilt-by-association assault [35], affiliation link revelation can
also lead to attribute disclosure. This attack implies that there exist groups of users
with the same sensitive attribute values; therefore, retrieving one user’s sensitive
value and the affiliation of another user to the group can assist in recovering the
sensitive value of the second user. This exploit was used to learn about users’
downloading patterns on the BitTorrent file-sharing network [36]. Communities
were discovered through social connections, and watching only one person in each
group was enough to deduce the interests of the others. The sensitive attribute that
consumers would wish to keep hidden in this scenario is whether or not they are
violating copyrights. This technique has also been used in a phone network to iden-
tify fake callers [35]. Data anonymization was used by Cormode et al. [37] to prevent
affiliation connection revelation.
4.2 k–Anonymity
When there is minimal variance in the sensitive characteristics, the adversary can
detect the value of the sensitive attribute for that collection of k-records using the
homogeneity attack. For example, a politician seeking election to a position in state
government uses his/her opponent’s medical background to show the public that
his/her opponent is unable to fulfill his/her responsibilities as an agent of the state
owing to his/her medical issues. He/she will have to use the hospital’s disclosed
data from the three-anonymized table to look for his/her opponent’s medical infor-
mation. Despite the fact that the data is most likely a three-anonymized table, he/
she can detect what disease his/her opponent has because there are few contrasts
(low variation) in the sensitive data because he/she has some knowledge about
him/her. For example, if he/she knows that the patient is a 25-year-old American
who resides in postal division 11003, based on this information, he/she deduces
that his/her competitor has heart disease.
Background knowledge is used by the adversary in this attack, and we will prove
that k-anonymity does not ensure privacy against background knowledge assaults.
A woman whose colleague’s father is ill, for example, must understand the nature
of the illness. She is aware that her coworker’s father is elderly and Mexican, so she
may deduce that he is suffering from either vitamin D deficiency or Alzheimer’s dis-
ease. Nonetheless, it is recognized that, for the most part, Mexicans are unaffected
10 Preeti Malik et al.
4.5 l-Diversity
4.6 t-Closeness
An equivalence class is said to have t-closeness if the distance between the distribution of a sen-
sitive attribute in this class and the distribution of the attribute in the whole table is no more
than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.
An equivalence class is a set of data that have the same values for their quasi-
identifiers.
Differential privacy: a solution to privacy issue in social networks 11
5 Differential privacy
Cynthia Dwork of Microsoft Research Labs invented differential privacy [42]. It is a
mathematical promise of privacy that sufficiently well-privatized queries may meet,
rather than a specific approach or procedure. Consider the following scenario in social
science research: Individual data from the surveys are combined into a dataset and
some analysis is done over it; the analysis may be privatized by injecting random
noise; and the final privatized result is published to the wider public. Differentially
private inquiries provide survey participants with a mathematical guarantee that the
results will not expose their involvement in the survey.
The aim behind differential privacy is to incorporate a controlled amount of sta-
tistical noise into results of the query to disguise the impact of a single individual
being added or removed from a dataset. It means, when an attacker queries two
nearly identical datasets (with only one record difference, for instance), the outcomes
are differentially privatized so that an attacker would not be able to discover any new
information about an individual with a high probability.
Let f stand for a query function that will be evaluated on the dataset D. We
want an algorithm A to run on a dataset D and output A(D), with A(D) being f(D)
with a regulated amount of random noise added. The purpose of differential privacy
is to get A(D) as near to f(D) as feasible to maintain data usefulness while also pro-
tecting the privacy of the dataset’s entities.
Differential privacy is primarily concerned with adversarial attacks that query
databases that differ only by a few elements. Differential privacy is divided into two
types: unbounded and limited, as defined by the concept of nearby datasets [43].
Unbounded means that for two datasets D and D0, D0 may be produced by adding
or subtracting a tuple from D. It is said to be bounded if D0 may be produced by
altering the value of a tuple from D, that is, bounded nearby datasets have the
same size, but unbounded neighboring datasets are one size apart. Although the
presentation of query results for unbounded and limited nearby datasets differs
slightly, the concepts of constructing and assessing differential privacy methods re-
main the same. As a result, we use both types of nearby datasets in this chapter to
demonstrate the introduced differential privacy techniques.
Pr½AðDÞ ϵ S ≤ eϵ Pr½AðD0 Þ ϵ S
preserved, the lower the value of ϵ because more noise must be supplied, a smaller
value gives more privacy preservation at the cost of reduced data accuracy. When
ϵ = 0, the level of privacy protection is at its highest, that is, “complete” protection.
The approach produces two outcomes with similar distributions in this situation;
however, the accompanying findings provide no valuable information about the da-
taset. As a result, the value of ϵ should strike a balance between privacy and data
utility. Usually requires extremely tiny values like 0.01, 0.1, or ln 2, ln 3 [42] in real
applications. In some cases, computing ϵ-differential privacy can be difficult. An ex-
tended idea of differential privacy is proposed to aid approximation.
Function f
A(D) = f(D)+z
User Database
For a privatized query PQ node privacy is preserved when the differential privacy
for each couple of graphs is satisfied and can be explained as follows:
Let G1 and G2 be two graphs with (V1,E1) and (V2,E2) as a set of vertices and
edges, respectively, such that
jðV1 ∪ V2 Þ=ðV1 ∩ V2 Þj = 1
and
fðE1 ∪ E2 Þ=ðE1 ∩ E2 Þg = fðu, vÞju = y _ v = yg
Differential privacy: a solution to privacy issue in social networks 13
Here y is the node that exists in ðV1 ∪ V2 Þ=ðV1 ∩ V2 Þ and the edge between nodes u
and v is represented by (u,v). Node privacy assures fully protection to both partici-
pant and subjects. An attacker with R will be unable to determine whether or not a
person y exists in the population. The queries we can compute are severely limited
as a result of this.
In this type of differential privacy, neighboring graph G’ of a given social network
G is derived by removing or introducing a node and all edges incident to that node.
The goal of node differential privacy is to preclude an attacker from identifying as if a
specific node x exists in the graph. It ensures privacy for individuals and relation-
ships at the same time, instead of a single relationship, at the expense of rigid query
constraints and lower accuracy outcomes. Under node privacy, a differentially private
algorithm must hide the worst-case disparity between neighboring graphs, which
can be significant. For instance, consider a star graph where a node is connected
to all nodes. In such scenario, the graph will have high sensitivity; moreover, add-
ing a noise to such graph is more vivid. Because of its high sensitivity, it is utterly
impossible for node privacy to provide correct network analysis, but required pri-
vacy protection can be obtained [45].
To preserve edge privacy in a decentralized query, all couple of graphs that exist for
the graphs G1 and G2 with (V1, E1) and (V2, E2) as a set of vertices and edges, respec-
tively, should satisfy differential privacy and both G1 and G2 also fulfill the follow-
ing property: V1 = V2 and fðE1 ∪ E2 Þ=ðE1 ∩ E2 Þg = 1.
To achieve edge privacy, a neighboring graph G’ is obtained by removing or add-
ing one edge from a social network graph G. It can be extrapolated to change up to k
edges. Edge privacy prevents an attacker from learning about specific user relation-
ships and also from identifying with high possibility whether two individuals are
friends. It also denies the probability of the existence of a single node having k friend-
ships with various nodes of the graph. In comparison to node privacy, this type of
privacy can only protect information about user relationships [46]. Regardless of the
fact that the associations between these nodes have been secured, nodes with higher
degrees seem to have a greater impact on query results. However, this is sufficient for
many applications and allows for the privatization of several types of queries than
that of the severely restricted node privacy. For example, for preserving email rela-
tionships, the edge privacy is used by different researchers [47].
14 Preeti Malik et al.
To preserve out-link privacy in a decentralized query, all couple of graphs that exist
for the graphs G1 and G2 with (V1, E1) and (V2, E2) as a set of vertices and edges,
respectively, should satisfy differential privacy, and both G1 and G2 also fulfill the
following property: V1 = V2 and a node y exists in such a way that
As the number of people using social networking sites continues to rise at a rapid
rate, privacy and security concerns are becoming increasingly prevalent. Users’ pri-
vate characteristics can be deduced from their public activity on social media, even if
they do not intend to reveal them. This type of privacy is called as private attribute
inference. The goal of private attribute inference is to uncover a concealed value of
the attribute that the user or service provider has purposefully hidden. In this type of
attack, an attacker tries to spread the values for missing or incomplete data of pub-
licly revealed attributes using the attribute information from social network. Any
party (e.g., a malicious user, an OSN provider, an endorser, a data negotiator, or a
monitoring agency) with an interest in users’ confidential data could be the attacker.
The attacker just has to obtain publicly available information from OSNs to carry out
such privacy attacks. Aside from privacy issues, the implicit user characteristics can
be cast off (by the invader or anyone who acquires the contingent user information
from the attacker) to engage in a variety of security-sensitive activities, like phishing
[59, 60] and attempting to compromise personal-information-based backup authenti-
cation [61]. Furthermore, an invader can utilize the inferred characteristic information
to associate online users across numerous sites [62–65] or with offline information
(such as records of voter registration which is publicly accessible) [66, 67], resulting
in even greater security and privacy problems.
Friend-based and behavior-based attribute inference assaults are the two types
of attacks now in use [50, 68–70]. Attacks on friends are predicated on the premise
that you are, whom you know. They want to extrapolate features for a user gathered
from multiple features extracted of the user’s friends and the social structure
among them. Homophily is the cornerstone of friendship-based assaults, which
means that two linked users have comparable characteristics. For example, if more
than 50% of a user’s friends major in IT engineering at a particular university, the
user is likely to major in IT engineering at that university as well. Behavior-based
attacks refer qualities for a user grounded on the public traits of users who are
16 Preeti Malik et al.
alike, and behavioral data is used to identify similarities between users [71–73].
These types of attacks are based on the concept that you are what you do. Users
with the similar traits, in particular, have comparable interests, characteristics, and
cultures, resulting in similar actions. For example, if a user loved music tracks,
apps, and books on Google Play that were comparable to those loved by Indian
users, the individual might belong to India.
The nodes in an anonymized graph and a reference graph are mapped with the legiti-
mate user identities as inputs in user de-anonymization, allowing the users’ charac-
teristics to be redefined in the anonymized graphs [18, 19]. Different anonymization
methods, like clustering, pseudonyms, graph amendment, and generalization, are
used to conceal the personal distinguishable information after that a service provider
usually releases an anonymized social network graph to various activists, like re-
searchers, application developers, advertisers, and government agencies [53–57]. A
reference graph can readily be created using information collected from various sour-
ces, for example, a distinct social network with overlapping participants with a pub-
licly available social graph. In comparison to an anonymized social network graph, a
reference graph typically has fewer node properties [58].
Backstrom et al. [74] reveal different types of active attacks on anonymized social
networks’ edge privacy. These active attacks presume that the attacker has the poten-
tial to modify the network before it is released. A malicious user selects a random set
of users whose private information it intends to breach, and creates a small quantity
of new user accounts including edges which are connected to the targeted users.
Then, among the new accounts, a structure of links is obtained with the motive of
achieving the anonymized graph structure. Both attacks rely on the creation of O(log
N) new “sybil” nodes (the number of nodes is represented by N), whose outgoing
edges aid in the quadratical reidentification of as many current nodes as possible.
The de-anonymization attacks are challenging to be conducted on large scale
because of the given reasons. To begin with, they are limited to OSNs; constructing
thousands of phony nodes in a phone call or real-world network is either too expen-
sive or impractical. Even with OSNs, many operators (e.g., Facebook) examine the
originality of email addresses and use other ways to verify the accuracy of supplied
data since generating thousands of dummy nodes is a challenge.
Subsequently, the adversary has minimal rheostat over the edges that flow into
the nodes he/she builds. A subgraph that does not have any incoming edges but
have many outgoing edges will stick out as most authorized users will have no in-
ducement to associate back with the sybil nodes. This could help the network oper-
ator figure out if the network has been hacked by a sybil attack. Other strategies for
Differential privacy: a solution to privacy issue in social networks 17
detecting sybil assaults in social networks exist [75], such as spammer detection
methods implemented by OSNs with unidirectional edges [76].
Another constraint of active attacks is that the usual social networks need a
mutual relation before any information in any form is made available. Assuming
that actual users do not link back to dummy users, the network does not show links
from false nodes to real ones.
We believe that large-scale active attacks that necessitate the establishment of
tens of thousands of sybil nodes are implausible. Active attacks can nevertheless be
effective for discovering or manufacturing a small number of “seeds” that can be
used to launch large-scale, passive privacy breaches.
Another type of de-anonymization attack is passive attacks in which a small
group of users uses their knowledge of the network topology surrounding them to
figure out where they are in the anonymized graph [76]. This assault is plausible,
but it only works on a small scale: the cooperating users can only violate the pri-
vacy of some of their friends’ users.
A graph’s degree distribution is a histogram that divides the nodes in the graph ac-
cording to their degree; it is frequently used to characterize the primary structure of
social networks with the motive of building graph models and comparing graphs. It
reflects graph structure statistics and may have an impact on the entire graph oper-
ation process.
Despite the fact that degree distributions are depicted as histograms, node pri-
vacy has a high sensitivity since one node impacts numerous counts in the distribu-
tion. If a node is deleted from the graph, the degree of all connected nodes is
reduced. A critical analysis shows that a node of degree d impacts not more than
2d + 1 values in the histogram. In the adverse scenario, if a node with highest degree
d is added or removed, it can modify only 2n + 1 values, indicating that global
18 Preeti Malik et al.
sensitivity is reliant on the number of nodes n in the graph. The degree histogram
query is unsustainable for differential privacy protection under node privacy be-
cause n is unbounded.
It is possible to safeguard degree histogram queries applying differential pri-
vacy within edge privacy. Deleting one edge from the graph impacts at most four
counts and modifies the degree of two nodes. Therefore, the sensitivity of k-edge
privacy is 4k. This k is a negligible amount of noise with a suitably large graph, re-
sulting in data utility preservation.
For a degree histogram query, out-link privacy necessitates less noise. When
only out-degrees are considered, eliminating one node’s out-links from a graph
changes one value in the histogram [48]. Under this privacy criterion, a node with a
large value of degree may still leave traces of its appearance in the dataset through
the friends’ out-degree. Though there are a variety of possible descriptions for the
graph having more than the expected degree among nodes, they could indicate
new connections among the nodes, or may have friendships with persons who were
not survey members. To use this susceptibility to anticipate the existence of high
node with any accuracy, an attacker would need to have a near-complete under-
standing of the real SN [77].
The graph G is considered as an input graph and H is taken as a query graph, and the
list of all isomorphic graphs of H in G is returned by subgraph counting query. Differ-
ent examples of subgraphs are triangles, k-stars, k-triangles, and k-cliques. A k-star is
made up of a center node that connects to k other nodes, a k-triangle is made up of k-
triangles that share one common edge, and clique including k-vertices is called k-
clique.
Subgraph counting queries involve varying levels of privacy and high global sen-
sitivities. For attaining differential privacy, a considerable quantity of noise must be
added, which may result in serious query result anomalies. As a result, the noise
magnitude is usually determined by a smooth upper bound of the local sensitivity.
Additionally, in the literature [78–80], truncation, ladder function, and Lipschitz ex-
tension were used to establish differential privacy while enhancing counting speed.
In this section, we look at triangle, k-star, and k-triangle counting issues.
graph’s number of triangles can be a good estimate to the true question response.
As a matter of fact, networks with a few high-degree nodes can use this method in
order to achieve node privacy for triangle counting.
Zhang et al. [80] presented a method which utilizes degree ordering for edge
removal. In the context of node, it addresses the differential privacy issue of in-
creasing sensitivity of the node degree distribution. Under node difference privacy,
two histogram techniques of degree distribution are provided: SER-cumulative his-
togram and SER histogram. For privacy of social network graphs, two types of un-
certain graph privacy protection algorithms are proposed by Wu et al. [84]. For
privacy protection using uncertain graph technique, the deterministic graph is con-
verted to probability graph. In social network scenarios where the privacy protection
is at utmost priority, the uncertain edge probability assignment algorithm is suitable.
However, its data availability must be enhanced. There are two histograms of triangle
counting distributions of node differential privacy: cumulative distribution histogram
and triangle counting distribution histogram. Although the projection method mini-
mizes query sensitivity, but data processing results in a significant loss of existing
graph information. Also, data availability is remarkably low.
Triangle counting protection for nodes has been explored, but the triangle
counting protection for edges remains unstudied. Furthermore, the projection algo-
rithm of node triangle counting results in significant information loss of the existing
graph and limited data availability. Improving the availability of published data
while ensuring differential privacy protection is a significant challenge.
A k-triangle is made up of k-triangles that all have the same edge. It is denoted
by fkΔ ðGÞ, where G is the input graph. When triangle counting is prolonged to k-triangle
counting, it becomes more difficult because calculating the smooth sensitivity of k-
triangle counting is NP-hard. As a result, current approaches primarily emphasize on a
trivial value of k, though the counting query of fkΔ is also difficult.
The main idea of [78] is to calculate ðϵ, δÞ differential privacy (edge privacy),
and for this computation, the noise is added relative to a second-order local sensi-
tivity in place of a “smooth” upper bound. LSkΔ denotes the local sensitivity and
cannot be used directly with the Laplace mechanism directly. It was demonstrated
that LS’ is a deterministic function and have the global sensitivity equal to 1, which
means it permits the to publish the query with less noise. Zhang et al. [80] pre-
sented another approach that uses a function called ladder function. This function
is used for counting k-triangle having edge privacy.
Differential privacy: a solution to privacy issue in social networks 21
Edge weights in SN may echo the communication frequency, the cost of doing busi-
ness, the familiarity of a relationship, and other factors associated with sensitive in-
formation. An intelligence network is a common example, in which edge weights
represent the recurrence with which two institutions communicate. Excessive com-
munication may indicate a problem. A commercial trade network is another example,
where edge weights represent the price of a transaction between two businesses.
Liu et al. [85] investigated the issues of conserving the efficacy of statistics of
shortest paths among nodes while protecting privacy in edge weights. They proposed
two approaches for preserving edge privacy: Gaussian randomization multiplication
and greedy perturbation. The greedy perturbation is concerned with the size of the per-
turbed shortest paths being preserved, whereas the Gaussian randomization is con-
cerned with retaining the same shortest paths before and after perturbation.
Another algorithm called edge weight anonymization for social network analy-
sis is proposed by Das et al. [32]. They created an LP model to safeguard graph prop-
erties such as k-nearest neighbors, shortest paths, and minimum spanning trees
which can be legitimized as linear edge weight functions. Costea et al. [86] used the
Dijkstra algorithm to evaluate shortest paths for evaluating protection quality.
Under the assumption that the graph is publicly accessible, it postulated differen-
tial privacy algorithms for the protection using the weights of edges. Users can ac-
cess the graph structure without making any changes, but the edge weights are
kept private. To enhance the published data utility and accuracy, an algorithm is
proposed by adding Laplace noise to each edge weight by Li et al. [1].
The majority of differentially private algorithms currently in use must make a
significant trade-off in utility in order to preserve privacy when analyzing extensive
large and multifaceted graph structures. Undeniably, as their primary contribu-
tions, several of those techniques strive to ameliorate utility. Furthermore, the intri-
cacy of figuring (smooth) sensitivities increases the complexities of differentially
private algorithms, if not NP-hard. Even for the queries of k-triangle counting, the
query’s structure is NP-hard.
8 Summary
Since social networks are growing very fast in these days, privacy breaches in social
networks are of major concern. This chapter discussed a solution to privacy issue of
social network, that is, differential privacy. Identity disclosure, attribute disclosure,
and link disclosure are major types of disclosure that occur in social network pri-
vacy breaches. Then solutions to these concerns are also explained in the chapter.
Application of differential privacy for social network analysis is also included.
22 Preeti Malik et al.
References
[1] Xiaoye, L., Yang, J., Sun, Z., & Zhang, J. (2017). Differential privacy for edge weights in social
networks. Security and Communication Network, 2017, 1–10. doi:https://doi.org/10.1155/
2017/4267921
[2] Hsu, T.-S., Liau, C.-J., & Wang, D.-W. (2014). A logical framework for privacy-preserving social
network publication. Journal of Applied Logic, 12(2), 151–174.
[3] Kulkarni, A. R., & Yogish, H. K. (2014). Advanced unsupervised anonymization technique in
social networks for privacy preservation. International Journal of Science and Research, 3(4),
118–125.
[4] Tripathy, B. K., Sishodia, M. S., Jain, S., & Mitra, A. (2014). Privacy and anonymization in
social networks. Intelligent Systems Reference Library, 65, 243–270.
[5] Jiang, H., Pei, J., Yu, D., Yu, J., Gong, B., & Cheng, X. Applications of Differential Privacy in
Social Network Analysis: A Survey. IEEE Transactions on Knowledge and Data Engineering,
pre-print available at: https://www.computer.org/csdl/journal/tk/5555/01/09403974/
1sLH8K2Abp6.
[6] Dalenius, T. (1977). Towards a methodology for statistical disclosure control. Statistik
Tidskrift, 15(429-444), 2–1.
[7] Sweeney, L. (2002). k-Anonymity: A model for protecting privacy. International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557–570.
[8] Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). L-diversity:
Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD),
1(1), 3–es.
[9] Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy beyond k-anonymity and
l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering (pp. 106–115).
IEEE.
[10] Wong, R. C.-W., Li, J., Fu, A. W.-C., & Wang, K. (2006). (α, k)-Anonymity: An enhanced k-
anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data mining (pp. 754–759).
[11] Zheleva, E., & Getoor, L. (2011). Privacy in Social Networks: A Survey. In Aggarwal, C. eds.
Social Network Data Analytics. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-
8462-3_10.
[12] Backstrom, L., Dwork, C., & Kleinberg, J. (2007). Wherefore art thou r3579x: Anonymized
social networks, hidden patterns, and struct. steganography. In Proceedings of International
World Wide Web Conference. pp. 1–10.
[13] Campan, A., Turta, T. M. (2009). A Clustering Approach for Data and Structural Anonymity. In
Proceedings of the 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust
in KDD (PinKDD’08), in Conjunction with KDD’08, Las Vegas, Nevada, USA, pp 33–54.
[14] Hay, M., Miklau, G., Jensen, D., & Towsley, D. (August 2008). Resisting structural
identification in anonymized social networks. In Proceedings of the VLDB EndowmentVolume
1Issue 1, pp 102–114.
[15] Hay, M., Miklau, G., Jensen, D., Weis, P., & Srivastava, S. Anonymizing social networks.
Technical report, University of Massachusetts, Amherst, March 2007.
[16] Korolova, A., Kenthapadi, K., Mishra, N., & Ntoulas, A. (2009). Releasing search queries and
clicks privately. In International World Wide Web Conference Committee (IW3C2), pp 171–180.
[17] Liu, K., & Terzi, E. (2008). Towards identity anonymization on graphs. In Proceedings of the
2008 ACM SIGMOD international conference on Management of data, Pages 93–106.
[18] Narayanan, A., & Shmatikov, V. (2009). De-anonymizing social networks. In 30th IEEE
Symposium on Security and Privacy, 2009, pp. 173–187.
Differential privacy: a solution to privacy issue in social networks 23
[19] Wondracek, G., Holz, T., Kirda, E., & Kruegel, C. (2010). A practical attack to de-anonymize
social network users. In IEEE Symposium on Security and Privacy, pp. 223–238.
[20] Ying, X., & Wu, X. (2008). Randomizing social networks: A spectrum preserving approach. In
Proceedings of the SIAM International Conference on Data Mining, pp.739–750.
[21] Zhou, B., & Pei, J. (2008). Preserving privacy in social networks against neighbourhood attacks.
In Proceedings of IEEE 24th International Conference on Data Engineering, pp. 506–515.
[22] Zou, L., Chen, L., & Ozsu, M. T. (2008). K-Automorphism: A general framework for privacy
preserving network publication. In Proceedings of the VLDB Endowment. Vol. 2 No.1, pp 946–957.
[23] Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and
suppression. International Journal of Uncertainty, 10(5), 571–588.
[24] Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets.
Security and Privacy (pp. 111–125).
[25] Lindamood, J., Heatherly, R., Kantarcioglu, M., & Thuraisingham, B. (2009). Inferring private
information using social network data. In Proceedings of the 18th international conference on
World wide web, Pages 1145–1146.
[26] Zheleva, E., & Getoor, L. (2009). To join or not to join: The illusion of privacy in social
networks with mixed public and private user profiles. In Proceedings of the 18th international
conference on World wide web, Pages 531–540.
[27] Sihag, V. K. (2012). A clustering approach for structural k-anonymity in social networks using
genetic algorithm. In Proceedings of the CUBE International Information Technology
Conference. pp. 701–706.
[28] Backstrom, L., Dwork, C., & Kleinberg, J. (2011). Wherefore art thou r3579x: Anonymized
social networks, hidden patterns, and struct. steganography. In Communications of the ACM,
Vol. 54, Issue 12, pp 133–141.
[29] Bhagat, S., Cormode, G., Krishnamurthy, B., & Srivastava, D. (2009). Class-based graph
anonymization for social network data. In Proceedings of the VLDB Endowment, Volume 2
Issue 1, pp 766–777.
[30] Korolova, A., Motwani, R., Nabar, S. U., & Xu, Y. (2008). Link privacy in social networks. In
Proceedings of the 17th ACM conference on Information and knowledge management, Pages
289–298.
[31] Zheleva, E., & Getoor, L. (2007). Preserving the privacy of sensitive relationships in graph
data. PinKDD (pp. 153–171).
[32] Das, S., Egecioglu, E., & Abbadi, A. E. (2010). Anonymizing weighted social network graphs.
In IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 904–907.
[33] Liu, L., Wang, J., Liu, J., & Zhang, J. (2009). Privacy preservation in social networks with
sensitive edge weights. In Proceedings of the SIAM International Conference on Data Mining,
pp. 954–965.
[34] Barbaro, M., & Zeller, T. (2006 Aug)A face is exposed for AOL searcher no. 4417749. New York
Times.
[35] Cortes, C., Pregibon, D., & Volinsky, C. (2002). Communities of interest. In Intelligent Data
Analysis. vol. 6, no. 3, pp. 211–219.
[36] Choffnes, D. R., Duch, J., Malmgren, D., Guimera, R., Bustamante, F. E., & Amaral,
L. (2009 Jun). Swarmscreen: Privacy through plausible deniability in p2p systems tech.
Technical Report NWU-EECS-09-04, Department of EECS, Northwestern University.
[37] Cormode, G., Srivastava, D., Yu, T., & Zhang, Q. (2008). Anonymizing bipartite graph data
using safe groupings. In Proceedings of the VLDB Endowment, Volume 1 Issue 1, pp 833–844.
[38] Barth-Jones, D. C. (2012 Jul). The ’Re-Identification’ of Governor William Weld’s Medical
Information: A Critical Re-Examination of Health Data Identification Risks and Privacy
24 Preeti Malik et al.
Protections, Then and Now. Tech. rep. Columbia University – Mailman School of Public
Health, Department of Epidemiology.
[39] Samarati, P., & Sweeney, L. (1998). Protecting Privacy when Disclosing Information:
K-Anonymity and Its Enforcement through Generalization and Suppression. Tech. rep.
[40] Machanavajjhala, A., et al (2006). l-Diversity: Privacy Beyond k-Anonymity. In Proceedings of
the 22nd International Conference on Data Engineering, ICDE 2006 3–8 April 2006 (p. 24).
Atlanta, GA, USA.
[41] Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy Beyond k-Anonymity and
l-Diversity. In: 2007 IEEE 23rd International Conference on Data Engineering (pp. 106–115).
[42] Dwork, C. (2008). Differential privacy: A survey of results. In Agrawal, M., Du, D., Duan, Z.,
& Li, A. Eds. Theory and applications of models of computation, ser. lecture notes in computer
science (pp. 1–19). Springer, Berlin/ Heidelberg.
[43] Kifer, D., & Machanavajjhala, A. (2011). No free lunch in data privacy. In Proceedings of the
2011 ACM SIGMOD International Conference on Management of data (pp. 193–204).
[44] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in
private data analysis. In Theory of cryptography conference (pp. 265–284). Springer.
[45] Hay, M., Li, C., Miklau, G., & Jensen, D. (2009). Accurate estimation of the degree distribution
of private networks,. In 2009 Ninth IEEE International Conference on Data Mining
(pp. 169–178). IEEE.
[46] Task, C., & Clifton, C. (2012). A guide to differential privacy theory in social network analysis.
In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and
Mining (pp. 411–417). IEEE.
[47] Kossinets, G., & Watts, D. J. (2006). Empirical analysis of an evolving social network. science,
311(5757), 88–90.
[48] Task, C., & Clifton, C. (2014). What should we protect? defining differential privacy for social
network analysis. In State of the art applications of social network analysis (pp. 139–161).
Springer.
[49] Abdulhamid, S. M., Ahmad, S., Waziri, V. O., & Jibril, F. N. (2014). Privacy and national
security issues in social networks: The challenges. arXiv preprint arXiv:1402.3301.
[50] Dey, R., Tang, C., Ross, K., & Saxena, N. (2012) Estimating age privacy leakage in online
social networks. In 2012 proceedings IEEE infocom (pp. 2836–2840). IEEE.
[51] Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable
from digital records of human behavior. Proceedings of the National Academy of Sciences,
110(15), 5802–5805.
[52] Gong, N. Z., Talwalkar, A., Mackey, L., Huang, L., Shin, E. C. R., Stefanov, E., Shi, E., & Song,
D. (2014). Joint link prediction and attribute inference using a social-attribute network. ACM
Transactions on Intelligent Systems and Technology (TIST), 5(2), 1–20.
[53] Ji, S., Li, W., Gong, N. Z., Mittal, P., & Beyah, R. A. (2015). On your social network de-
anonymizablity: Quantification and large scale evaluation with seed knowledge. In NDSS.
[54] Ji, S., Li, W., Srivatsa, M., & Beyah, R. (2014). Structural data deanonymization:
Quantification, practice, and implications. In Proceedings of the 2014 ACM SIGSAC
Conference on Computer and Communications Security (pp. 1040–1053). ACM.
[55] Qian, J., Li, X.-Y., Zhang, C., & Chen, L. (2016). De-anonymizing social networks and inferring
private attributes using knowledge graphs. In IEEE INFOCOM 2016-The 35th Annual IEEE
International Conference on Computer Communications (pp. 1–9). IEEE.
[56] Ji, S., Wang, T., Chen, J., Li, W., Mittal, P., & Beyah, R. (2017). De-sag: On the de-
anonymization of structure-attribute graph data. IEEE Transactions on Dependable and
Secure Computing 16(4), pp. 594–607.
Differential privacy: a solution to privacy issue in social networks 25
[57] Shirani, F., Garg, S., & Erkip, E. (2018). Optimal active social network de-anonymization using
information thresholds. In 2018 IEEE International Symposium on Information Theory (ISIT)
(pp. 1445–1449). IEEE.
[58] Shao, Y., Liu, J., Shi, S., Zhang, Y., & Cui, B. (2019). Fast deanonymization of social networks
with structural information. Data Science and Engineering, 4(1), 76–92.
[59] Jakobsson, M. (2005). Modeling and preventing phishing attacks. In Financial Cryptography
and Data Security. FC 2005 Patrick, A. S., & Yung, M. Eds. Lecture Notes in Computer
Science vol. 3570. Springer, Berlin, Heidelberg.
[60] Spear Phishing Attacks. 2017. Retrieved from http://www.microsoft.com/protect/yourself/
phishing/spear.mspx.
[61] Gupta, P., Gottipati, S., Jiang, J., & Gao, D. (2013). Your love is public now: Questioning the
use of personal information in authentication. In Proceedings of the 8th ACM SIGSAC
symposium on Information, computer and communications security, pp. 49–60.
[62] Afroz, S., Caliskan-Islam, A., Stolerman, A., Greenstadt, R., & McCoy, D. (2014).
Doppelganger finder: Taking stylometry to the underground. In IEEE Symposium on Security
and Privacy (pp. 212–226). San Jose, CA.
[63] Bartunov, S., Korshunov, A., Park, S.-T., Ryu, W., & Lee, H. (2012). Joint link-attribute user
identity resolution in online social networks. In Proceedings of the 6th International
Conference on Knowledge Discovery and Data Mining, Workshop on Social Network Mining
and Analysis. ACM pp. 1–9.
[64] Goga, O., Lei, H., Parthasarathi, S. H. K., Friedland, G., Sommer, R., & Teixeira, R. (2013).
Exploiting innocuous activity for correlating users across sites. In Proceedings of the 22nd
international conference on World Wide Web, Pages 447–458.
[65] Goga, O., Perito, D., Lei, H., Teixeira, R., & Sommer, R. (2013). Large-scale Correlation of
Accounts Across Social Networks. Technical report. International Computer Science Institute.
Technical Report TR-13-002, Berkeley, California.
[66] Minkus, T., Ding, Y., Dey, R., & Ross, K. W. (2015). The city privacy attack: Combining social
media and public records for detailed profiles of adults and children. In Proceedings of the
2015 ACM on Conference on Online Social Networks, Pages 71–81.
[67] Sweeney, L. (2002). k-Anonymity: A model for protecting privacy. International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems, 10 5(2002), 557–570.
[68] Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable
from digital records of human behavior. Proceedings of the National Academy of Sciences,
110 15(2013), 5802–5805.
[69] Gong, N. Z., Talwalkar, A., Mackey, L., Huang, L., Shin, E. C. R., Stefanov, E., Shi, E., & Song,
D. (2014). Joint link prediction and attribute inference using a social-attribute network. ACM
Transactions on Intelligent Systems and Technology (TIST), 5(2), 1–20.
[70] Labitzke, S., Werling, F., Mittag, J., & Hartenstein, H. (2013). Do online social network friends
still threaten my privacy? In Proceedings of the third ACM conference on Data and application
security and privacy (pp. 13–24).
[71] Weinsberg, U., Bhagat, S., Ioannidis, S., & Taft, N. (2012). Blurme: Inferring and obfuscating
user gender based on ratings. In Proceedings of the sixth ACM conference on Recommender
systems (pp. 195–202).
[72] Chaabane, A., Acs, G., Kaafar, M. A. et al. (2012). You are what you like! information leakage
through users’ interests. In Proceedings of the 19th Annual Network & Distributed System
Security Symposium (NDSS). Citeseer.
[73] Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable
from digital records of human behavior. Proceedings of the National Academy of Sciences,
110(15), 5802–5805.
26 Preeti Malik et al.
[74] Backstrom, L., Dwork, C., & Kleinberg, J. (2007). Wherefore art thou R3579X? Anonymized
social networks, hidden patterns, and structural steganography. 16th International World
Wide Web Conference. Banff, Alberta, Canada.
[75] Yu, H., Gibbons, P., Kaminsky, M., & Xiao, F. (2008). Sybil-Limit: A near-optimal social
network defense against sybil attacks. S&p (pp. 3–17).
[76] Schonfeld, E. (2008). Techcrunch: Twitter starts blacklisting spammers. http://www.tech
crunch.com/2008/05/07/twitter-starts-blacklisting-spammers/.
[77] Raskhodnikova, S., & Smith, A. (2015). Efficient Lipschitz extensions for high dimensional
graph statistics and node private degree distributions. arXiv preprint arXiv:1504.07912.
[78] Karwa, V., Raskhodnikova, S., Smith, A., & Yaroslavtsev, G. (2011). Private analysis of graph
structure. Proceedings of the VLDB Endowment, 4(11), 1146–1157.
[79] Kasiviswanathan, S. P., Nissim, K., Raskhodnikova, S., & Smith, A. (2013). Analyzing graphs
with node differential privacy. In Theory of Cryptography Conference (pp. 457–476). Springer.
[80] Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D., & Xiao, X. (2015). Private release of
graph statistics using ladder functions. In Proceedings of the 2015 ACM SIGMOD
international conference on management of data (pp. 731–745).
[81] Nissim, K., Raskhodnikova, S., & Smith, A. (2007). Smooth sensitivity and sampling in private
data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of
computing (pp. 75–84).
[82] Sun, H., Xiao, X., Khalil, I., Yang, Y., Qin, Z., Wang, H., & Yu, T. (2019). Analyzing subgraph
statistics from extended local views with decentralized differential privacy. In Proceedings
of the 2019 ACM SIGSAC Conference on Computer and Communications Security
(pp. 703–717).
[83] Qin, Z., Yu, T., Yang, Y., Khalil, I., Xiao, X., & Ren, K. (2017). Generating synthetic
decentralized social graphs with local differential privacy. In Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security (pp. 425–438).
[84] Wu, D., Zhang, B., Jing, T., Tang, Y., & Cheng, X. (2016). Robust compressive data gathering in
wireless sensor networks. IEEE Transactions on Wireless Communications, 12(6), 2754–2761.
[85] Liu, L., Wang, J., Liu, J., & Zhang, J. (2008). Privacy preserving in social networks against
sensitive edge disclosure. Technical report, Technical Report CMIDA-HiPSCCS (pp. 006–08).
[86] Costea, S., Barbu, M., & Rughinis, R. (2013). Qualitative analysis of differential privacy
applied over graph structures. In 2013 11th RoEduNet International Conference (pp. 1–4).
IEEE.
Abdul Rahman, Krishnadas Nanath✶
Cracking Captcha using machine learning
algorithms: an intersection of Captcha
categories and ML algorithms
Abstract: Captcha (Completely Automated Public Turing test to tell Computers and Hu-
mans Apart) is a challenge-based response test in order to differentiate between a
human and a bot. Captcha came into use with the advent of spambots taking up space
posing as humans. Captcha took the Internet by storm as it had multiple uses and ca-
pabilities like averting comment spam in blogs, safeguarding website registrations,
shielding e-mail addresses from scrapers, preventing dictionary attacks, and counter-
acting search engine bots. This chapter aims at categorizing text Captcha into various
types based on inputs from the literature and visual appearance. It then uses a series
of machine learning (ML) algorithms to crack the actual Captcha content using training
data. The research investigates the cross section of Captcha type and ML algorithm. A
dataset of 1,024 Captcha images was considered for conducting this experiment. It
identifies which ML algorithm is most effective in cracking Captcha across various cat-
egories. In turn, it helps identify the loopholes in the easy identification of Captcha via
automated algorithms. This will enable Captcha users to use the most suitable category
of Captcha that is least vulnerable to ML crackdown.
1 Introduction
Nowadays, machine learning (ML) is used in several domains like image recogni-
tion, speech recognition, medical diagnosis, and learning associations. The capabil-
ities of ML are vast and versatile in nature, which makes it a two-sided sword as
black-hat hackers, and people with malicious intent can use it in a harmful way.
One of the common misuses of ML is to crack several security mechanisms that pro-
tect consumer data in this digital world.
Captcha (Completely Automated Public Turing test to tell Computers and Hu-
mans apart) is a challenge-based reaction test to separate between a human and a
bot. Captcha came into utilization with the dawn of spambots occupying memory in
organization databases replicating humans. It overwhelmed the web as it could
✶
Corresponding author: Krishnadas Nanath, Middlesex University, Dubai,
e-mail: [email protected]
Abdul Rahman, Middlesex University, Dubai
https://doi.org/10.1515/9783110766745-002
28 Abdul Rahman, Krishnadas Nanath
solve several problems like averting comment spam in blogs, safeguarding website
registrations, protecting e-mail addresses from scrapers, preventing dictionary at-
tacks, and balancing search engine bots.
With the growth of Captcha and its applications, reCaptcha was introduced. It
is an evolved Captcha-based system and is a combination of advanced Turing tests
along with browser data testing (cookies). The applications of reCaptcha are similar
to a Captcha, and its core purpose is to differentiate between a human and a bot.
Captcha faced some criticism, as it was deemed easy to trounce using software de-
veloped by agencies. Thus, reCaptcha was introduced with higher security require-
ments in an attempt to make it difficult to crack it.
With the growth of advanced technologies and stores becoming cheaper, it has
become easier for the Captcha to be trounced. The time required to trounce is re-
duced over the years due to major improvements in ML capabilities. Artificial intel-
ligence (AI), reduced cost of the cloud, reduced cost of hardware, and outsourcing
of the ML engines are several factors that have contributed to the time reduction.
This chapter is an attempt to understand the power of ML algorithms in breaking
down Captcha and understand which types of CAPCTHAs are more prone to break
down with ML algorithms. Further, it also highlights the loopholes, privacy, and
security issues with RECaptcha. Secondary data of Captcha images are used for this
research, and it is under the Creative Commons license to review the information
and share it.
2 Literature review
Captcha is a challenge-based response test in order to differentiate between a
human and a bot. Captcha came into use with the advent of Spambots taking up
space and posing as humans. ReCaptcha was further introduced as evolved Captcha
mechanism with better synchronization and advanced features. Captcha and Re-
Captcha faced a lot of criticism over the years for varying reasons ranging from pri-
vacy to redundancy and other issues.
This section aims to highlight and document the past efforts in the context of
Captcha and ReCaptcha. It also documents how advancement in ML capabilities im-
pacted this domain. The literature review proceeds as follows: the methodology of
the literature review is presented first. It highlights the process of article collection
and its specifics, including types of articles taken into consideration. Further, the
research timeline is discussed, and it aims to showcase how research on Captcha
has evolved over time based on the advancement of related technologies. The core
review is then presented, followed by the identification of gaps. A summary of the
literature review process is presented in Figure 1.
Cracking Captcha using machine learning algorithms 29
Gap of the
Introduction Methodology TImeline Review
review
The articles, books, and journals examined throughout the literature review
were collected from various scholarly databases and search engines, including Goo-
gle scholar, Institute of Electrical and Electronics Engineers, and other databases.
The collection of the review also includes a few peer-reviewed blogs and magazines
discussing how-to articles elaborating captcha issues for people with disabilities.
The diversification in the search methods was adopted in order to present a more
unified research output while diminishing any chances for institutional bias toward
the subject. In order to reduce further bias from the review, all aspects, including
the benefits and issues of Captcha and ReCaptcha, were taken into consideration.
The content analysis of initial articles resulted in keywords that could be used for
review searches. The following keywords were used: captcha vulnerabilities, Captcha,
benefits of Captcha, usage of ReCaptcha, convolutional neural networks (CNN), Re-
captcha, neural networks and Captcha, issues with Captcha, and Completely Automated
Public Turing test. The search with these keywords resulted in around 67 related articles
and papers; these papers were filtered by disregarding the searches which were too
basic in nature (included basic introductions and covered simple case studies). This was
preceded by removing the reports which were not related to the field and turned up due
to the similarity of the keywords. The final collection ensued into a collection of 54 ar-
ticles relevant to the literature review. A split of publication types is provided in Figure 2.
ARTICLE CLASSIFICATION
journal magazine conference proceedings
33%
40%
27%
These articles were then analyzed over the years to understand the importance
given to this field of research. A summary of this timeline is presented in Figure 3. It
can be observed that the trend has been increasing over the years, particularly after
2015. With the growth of ML capabilities and computational power, the issues re-
lated to Captcha cracking became an interesting area of research. Many articles
post-2015 use research and computing techniques to solve the issues arising in this
domain.
12
10
0
2000 2005 2010 2015 2020 2025
The relevant research papers started a few years after the introduction of Captcha, as
its adoption became widespread, and the number of users grew (Robinson, 2002a).
Researchers started documenting issues while finding ways to break Captcha using
text-based synthetic analysis techniques and various other methodologies. Some re-
searchers tried to highlight the issues with Captcha from a community perspective.
This resulted in the rise of research articles trying to resolve the problems. The wide-
spread use of the Internet and an increasing number of websites (Yahoo, Microsoft,
and others) further added to the scope of conducting research. Captcha was used to
filter out bots [1], but companies and agents found a commercialized way to break
Captcha (a lengthy process), and it became widespread. This was followed by the re-
lease of ReCaptcha, where the number of researches reduced, with students and re-
searchers trying out new things with no particular breakthrough.
The year 2016 resulted in an increase in research toward Captcha and Re-
Captcha due to evolving technologies in the AI and ML sector. In 2016, major
cloud ML platforms (Amazon Web Services and Google cloud) took the Internet by
storm. This resulted in people with crunched resources being able to get their
hands on advanced technologies with the usage of the Internet. This resulted in the
rise of research articles as well.
Cracking Captcha using machine learning algorithms 31
In order to review the articles, all the papers were examined using a breakdown
of the core concepts being discussed in the articles. This resulted in four categories
of themes discussed in these papers. The four different themes of articles in the con-
text of Captcha, ReCaptcha, and ML were: theory-driven, critical view of Captcha,
supportive view of Captcha, and other related articles.
Paper Description
[] This documented research is about the analysis of various ways machine learning can
be used to perform optical character recognition. One of the techniques discussed was
how an algorithm can be used to break down Captcha.
[] This paper is about the analysis of Captcha and how the technology has evolved
throughout the years to counter normal Captcha. It also introduced a new type of
Captcha using not just characters but numbers to counter the cracking down of Captcha.
[] This research paper analyzes text-based Captcha, and describes the pros and cons of
using a text-based Captcha for both the designers/attackers of text-based Captcha.
[] This research paper highlights the status of Captchas and how its value has changed
throughout the years. It also described the way technology impacted the Internet.
[] In this documented research, the given content talks about the analysis conducted on
the real-world deployed image Captcha. It also analyzes the strengths and
weaknesses. The evaluation of security and attacks is also presented.
[] This paper talks about Internet security and the part played by Captcha to keep it safe.
It recommends that Captcha improves the user experience by keeping the spambots
away and defending against different types of Internet attacks. This paper also
showcases different types of Captcha used online.
[] This is a generic study on how Captcha is used and the way it impacts the Internet.
Robinson This article highlights the history of Captcha and how it turned from a test for artificial
(b) intelligence to protect the Internet as a whole from spambots to scammers.
32 Abdul Rahman, Krishnadas Nanath
Table 1 (continued )
Paper Description
[] This article highlights the disadvantages of spambots and how Captcha is helping this
issue.
[] Other than the importance of Captcha, this paper also highlights the criticism given to
Captcha and how it might not be useful for all stakeholders.
[] The authors examine different types of Captchas to explore the use of different colors
to negatively affect the usability or security of the Captchas.
Paper Description
[] This paper does an analysis on Captcha and the ease of cracking it down using the optical
character recognition program (with almost % success rate).
[] This research highlights the attacks that were carried out on the Asirra Captcha [decom]
using machine learning and how it defended against those attacks. The paper also reviews
Asirra Captcha.
[] This paper introduces a new character segmentation technique of general value that can be
used to attack a wide number of text-based Captcha. It demonstrates how easy it is to
trounce the text recognition task given by Microsoft Captcha.
[] This research is based on how to crack down text-based Captcha using sparse
convolutional neural networks. Since many web service providers still use text-based
Captcha, this research exposes the loophole with AI algorithms.
[] This article illustrates the common method of breaking down Captchas using segmentation.
[] This article highlights the ease of cracking down text-based Captchas using automatic
segmentation. It also uses the recognition of Captchas with variable orientation and
random collapse of overlapped characters.
Cracking Captcha using machine learning algorithms 33
Table 2 (continued )
Paper Description
[] This paper talks about the robustness of text-based Captchas and how easy it is to crack down
on text-based Captchas. It also recommends other alternatives to text-based Captcha.
[] This paper talks about machine learning techniques in cracking down text-based Captchas
and image-based Captcha. It conducts an analysis on text-based Captcha and image-based
Captcha.
[] In this paper, the authors suggest cost-effective techniques to find loopholes in the
Captcha system.
Paper Description
[] This research describes the way one can use Captcha to safeguard from the dangers
of the Internet (spambots, data security, and others).
[] This paper outlines various ways to stop the illegal cracking of software on mobile
phones through the use of Captcha. Since phones are not able to process optical
character recognition programs, it could help in cracking Captcha.
Robinson This article does a study on how Captcha helped in distinguishing between humans and
(a) computers. It also describes how Yahoo kept out rogue spammers from its database.
[] This article mentions that there is no method that can guarantee spam protection
completely. Internet users will always end up finding different ways to attack using
spam, no matter how strong the defense mechanism is. Hence, Captcha falls in the
same category and has its positives and negatives.
34 Abdul Rahman, Krishnadas Nanath
Table 3 (continued )
Paper Description
[] This article illustrates how e-commerce uses a Captcha to defend itself from attacks
that are possible through spambots and also does an analysis on different types of
Captchas.
[] This article highlights how Captcha has been effective in getting rid of spammers.
Paper Description
[] This article is based on how Captcha came crashing down on those who are blind
or visually impaired. This research raised questions on inclusivity.
Salvatore This article talks about the various types of Captchas and mentions how easy it is
() to break down Captcha using deep learning, even if one is a rookie.
The review of these articles in all categories overall illustrates how both captcha
and ML have evolved throughout the years. This review reveals that there is a need
for an alternate Turing system mechanism. The current technologies (like Re-
Captcha) can be considered safe to use but come with a lot of privacy and ethical
issues. Therefore, a better alternative is needed for making this domain sustainable
for further applications.
The review also suggests that Captcha and ReCaptcha have not failed, but the
advancement of ML technologies is the reason for their downfall. The downfall of
Captcha and ReCaptcha is inversely related to the advancement of ML capabilities
and ease of its access. Other than the growth of computational power and ML, there
are inclusivity issues also that Captcha faces. Thus, a better alternative can be
crafted if Captcha is proven desolate.
Cracking Captcha using machine learning algorithms 35
7 Research method
The core idea of the experiment was to propose various categories of Captcha that
have not been explored in previous research papers. The dataset of Captcha images
was divided into the proposed categories. These categories were then monitored
across various ML algorithms to identify those categories which are more vulnera-
ble to get cracked by the techniques. This could provide design guidelines for
Captcha designers for avoiding the crackdown from ML algorithms.
It was decided to use a public dataset to build the research design. This could
assist in a large set of Captcha images to learn from and develop categories. It could
provide ample training data for the successful working of the ML model. While
there are versions of the dataset available on Kaggle, this research used a modified
and cleaned dataset by Rodrigo Wilhelmy available on ResearchGate [27]. The data-
set consisted of 1,040 images along with 1,040 labels for the images. The Captcha
dataset is based on the most common captchas used, and the images in the dataset
are five-letter words that contain a combination of numbers and alphabets. The
five-letter words provide consistency in the application of ML algorithms for crack-
ing and predicting the actual letters and numbers. The bias of the length is removed
by maintaining consistency. A sample Captcha and its correct classification in the
text is provided in Figure 4.
The dataset was further categorized into interesting categories that have not been
explored in the literature. Since the images were located in one folder with one
Captcha in a single image, codes were developed in Python to classify these images
into multiple categories. Once the category classification was completed, they were
appended to the dataset. The categories are based on the following parameters:
crossline, blur percentage, and Captcha boundary. The crossline categories describe
the way a crossover line is drawn across the Captcha image. This could be angular
or perpendicular in nature. Blur percentage indicates the percentage of Captcha im-
ages that are blurred compared to the bold characters. The Captcha boundary indi-
cates the first and last characters of the image. This could be numbers on both
ends, characters on both ends, or a mix of numbers and characters. A summary of
these categories is presented in Table 5.
36 Abdul Rahman, Krishnadas Nanath
Crossline Angular category The crossover line on the Captcha image is at an angle (not in
category parallel) to the direction of the text.
Blur Majority More than % of the image includes blurred characters and
percentage numbers.
Minority Less than % of the image includes blurred characters and
numbers.
Captcha Numbers The first and the last characters of the Captcha image are
boundary numbers.
Characters The first and the last characters of the Captcha image are
letters.
Number_Character The first and the last characters of the Captcha image include
both a number and a character.
The codes for the entire process were developed using Python, and the follow-
ing libraries were used:
a. Matplotlib: It is a cross-platform data analysis and interactive plotting library
written in Python for use with NumPy’s numerical extension. It represents a fea-
sible alternative to MATLAB in the open-source world. Additionally, developers
can use matplotlib’s Application Programming Interfaces to integrate plots in
graphical user interface software.
b. Numpy: It is a Python module that adds support for huge, multidimensional ar-
rays and matrices, as well as a large set of high-level mathematical functions
for working with these arrays.
c. Keras: It is a robust and simple-to-use Python library for designing and testing
deep learning models. It wraps the quick numerical computing libraries “Theano”
and “TensorFlow” and enables the definition and training of neural network mod-
els in a few lines of code. It reduces the amount of user activities needed to com-
plete basic tasks and delivers simple and actionable error messages.
d. TensorFlow: It is a data flow graph-based software library for numerical compu-
tation of mathematical expressions. The graph’s nodes represent mathematical
operations, while the edges correspond to the multidimensional data arrays
(tensors) that pass between them
Cracking Captcha using machine learning algorithms 37
In order to implement the pipeline of ML algorithms that could crack the Captcha
images in the dataset, the following steps were used:
1. Identify the unique letter and numbers represented in the images.
2. Create an array of indices and shuffle it if necessary.
3. Determine the sample size for preparation.
4. Divide the data into training and validation sets.
a. Examine the picture.
b. Decode and grayscale transform.
c. Convert to float32 in the range [0, 1].
d. Resize the image to the correct dimensions.
e. Transpose the picture such that the time dimension corresponds to the dis-
tance of the image.
f. Convert the characters in the mark to numerical values.
5. Create a directory since our model needs two inputs.
6. Calculate the training time failure value and use self.add loss() to adjust it to
the layer.
7. Return the computed predictions after the evaluation.
8. Include the initial convolutional block and then incorporate the second convo-
lutional block.
9. Use two maximum pools, each with a different pool height and strides.
10. Until forwarding the performance to the RNN component of the model, reshape
it appropriately.
11. Provide the model with photographs and add an output layer, followed by a
Connectionist temporal classification (CTC) layer that will calculate CTC loss at
each stage for improved analysis.
12. Set the number of folds (epoch): An epoch is a concept used in ML that refers to
the number of passes the ML algorithm has made over the entire testing data-
set. Typically, datasets are organized into batches (especially when the amount
of data is very large). Certain individuals use the word iteration imprecisely, re-
ferring to the process of running one batch through the model as an iteration.
Each epoch updates the dataset’s internal model parameters. As a result, the
batch gradient descent learning algorithm is named for a single batch epoch.
Typically, an epoch’s batch size is one or more and is often an integer value in
the epoch series. Determining how many epochs a model can run to learn is
dependent on a number of parameters relating to both the data and the model’s
objective, and although attempts have been made to automate this method, a
thorough interpretation of the data is often required.
13. Perform the training and testing procedures.
38 Abdul Rahman, Krishnadas Nanath
As it can be observed that while the overall accuracy was 94%, the insights on cate-
gory prediction provide additional value. The last column in Table 6 indicated the per-
centages of cases that could not be cracked by the best ML algorithm. It can be
observed that Captcha images with blur percentage in the majority were the toughest
to crack. This could provide future design guidelines on Captcha images to prevent
against ML crackdown. In the crossline category, it was found that the perpendicular
category had more cases of defense against crackdown when compared to the angular
category. When the boundary of Captcha is considered (first and the last), it was
found that a mix of numbers and alphabet provided the maximum resistance to ML
algorithm crackdown. Therefore, if all the categories are combined, the best design of
Captcha would have perpendicular crossover, blur space in the majority, and a mix of
number and alphabet as the first and last characters of the text.
While this is a preliminary experiment to understand the design, it opens up
the conversation around best designs to defend against ML crackdown. The catego-
ries were designed based on the available Captcha images in the dataset. However,
with a greater variety of test images, more meaningful categories can be designed.
Further, the sample size can be increased in the future to test the robustness of the
model with various algorithms in place.
Other documents randomly have
different content
kertoja puhui siitä jättiläismäisestä vihollis-ratsastajasta, joka sulloi
kaikki, — sekä turhista koetuksistansa saada Slangea henkiin… ja
mitenkä hän, Niemand, tuli kumoon ratsastetuksi ja makasi
tunnotonna Breitin raskaan ruumiin alla… mitenkä hän näki tähden
astuvan alas luoksensa ja puhuvan kummallisia sanoja…
9. Kolme turvattia.
Kolme kruunua on Ruotsin kilvessä, mutta ainoastaan yksi
jalopeura.
"Kolme, sanot sinä? Tiedätkö, Hagar Ring, että, jos minä olisin
kuningas Herodes, antaisin minä surmata sinun, niin oppinut kuin
oletkin, ja veljesi, vaikka onkin tuima! Vaarallista on tähtien kanssa
leikitellä: ei saa löytyä muuta kuin yksi Jupiterin turvatti Ruotsin
valtakunnassa. Hoh, ole huoletta: minä en ole Herodes, enkä
myöskään Herodias; sinä ja veljesi saatte molemmat elää sillä
ehdolla, että te ette rupea mihinkään salaliittoon. Unen-näköjä, nuo
tuollaiset tähdistä, jotka valuvat alas ja hedelmöittävät ihmis-elämää!
Mutta salaliitot valtaistuinta ja hallitsiasukua vastaan eivät ole mitään
unennäköjä, ne ovat maailmanhistoriaa. Jos sinä tulet olemaan
minun lähelläni — se kyllä on mahdollista, en tiedä sitä vielä — niin
löytyy niitä, jotka käyttävät sinua aseenaan. Ole silloin suora ja sano
minulle kaikki, kaikki, sinä ymmärrät! Älä pöyhkeile, jos tahdot elää!
Älä ajattele, että meitä on kolme! Minä sanon sinulle, muurahainen,
että kolme kruunua on Ruotsin kilvessä, mutta ainoastaan yksi
jalopeura… Sellainen pieni mateliainen kuin sinä, luulet olevasi
Jupiterin turvatti!… Hyvä, sinä tiedät nyt kuka minä olen ja kuka sinä
olet, yhtä kuin Beata-rouvan se pitää tietämän. Aseta itsesi sen
mukaan, niin minä sinua autan. Sinä olet ainoa nainen, jota minä
saatan kärsiä, paitsi Ebba Sparrea… Mutta meidänhän piti tutkia
sinun syntyperääsi… 1626? Se oli kuningasvainajan puolalaisen
sodan aikana Liivinmaalla. Äitisi varmaankin oli sieltä, mutta isäsi
joko ruotsalainen tahi suomalainen, koska tuo liiviläinen oli tullut
Suomeen. Ja lisäksi: isäsi on arvattavasti ollut ylhäistä sukuperää,
sillä alhaisella henkilöllä ei olisi ollut mitään syytä olla itseänsä
ilmoittamatta tuollaisen hälinää nostavan tapahtuman perästä, kuin
tuo äitiäsi koskeva oli. Täytyihän hänen siitä saada tietoa, jos hän
vain vielä elossa oli."
"Minä olen vakuutettu siitä, että hän oli kuollut", virkkoi Hagar,
joka liikutettuna, mutta ilman kyyneleitä kuunteli noita arvelemisia.
Kuningattaren ensimmäinen kiivaasti esiintuotu vaatimus, että
ainoastaan hänellä oli oikeus olla Jupiterin turvattina, herätti Hagarin
sydämmessä salaista vastusta. Mitäpä tähdet kysyisivät sitä,
kaitsivatko mökkejä tahi kuninkaanlinnoja?
"Karja."
"Niin, eikö ole sillä tavalla? Ccos cci curant, insanos insani…
[Sokeat parantavat sokeita ja tyhmät tyhmiä.] Mutta hän ei latinaa
ymmärrä, tuo veliraukkasi… Niemand, minä puhun hänen
arvoisuudellensa Johannes Mathiaksenpojalle, että hän valaisee
pimeän järkesi… Kuka oli isäsi?"
"Etkö sitä tiedä? Minä sanon sen sinulle. Sinun isäsi oli upseeri,
joka hylkäsi vaimonsa. Älä koskaan, Niemand, naista hylkää, jolle
olet antanut sanasi ja lupauksesi! Se on kuolemansynti; se tulee
vielä kostamaan sinulle ja lapsillesi. Etkö tiedä mitään
vanhemmistasi?"
11. Linnanpalo.
Tulla Breitenfeldistä ja kaatua Tukholman linnassa!
"Ei opissakaan…"
Hagar säpsähti. Tätä hän ei ollut ennen ajatellut. "Hyvä. Sinä sen
nyt tiedät. Mahdollista on, ettei Kersti neiti ole Aristotelesta lukenut.
Asetu asianhaarojen mukaan!"
ebookball.com