bitnode HEATMAP
bitnode HEATMAP
Computing
Hyun-Min Chang
[email protected]
Supervisors:
Dr. Lucianna Kiffer, Lioba Heimbach
Prof. Dr. Roger Wattenhofer
i
Abstract
In this Bachelor’s thesis, I aimed to gain a broad overview of the node activity
for multiple cryptocurrencies by gathering data from various node explorers and
consolidating the data for analysis. The objective of this study is to understand
the distribution of nodes across different cryptocurrencies and identify any pat-
terns or trends that may exist. The results show that when looking at the overall
volume of cryptocurrencies, Bitcoin is leading by a large margin. This comes as
no surprise as Bitcoin represents the most well-known cryptocurrency today and
consistently has the largest market capitalization. I observed that the majority
of nodes for almost all of the observed cryptocurrencies are located in either the
USA or Germany and also noticed a general downward trend in node activity
for various cryptocurrencies during the data collection period. It is interesting
to note, however, that the distribution of the total number of nodes and that of
the number of active nodes show clear discrepancies in the data on Bitcoin. Sur-
prisingly, I also observe that two explorers that report on Bitcoin node data, and
therefore should theoretically report similar information, have large discrepan-
cies on the nodes they report. Reasons for this could lie in the methodology used
by the node explorer to gather the data, or that the data is received differently
depending on the location where the data is gathered.
In this paper, I will present our findings, discuss the limitations of the study,
and conclude with recommendations for future research.
ii
Contents
Acknowledgements i
Abstract ii
1 Introduction 1
1.1 Background and Context on the Peer-to-Peer Network Architecture 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Data Collection 3
2.1 Overview of Data Sources and Methods . . . . . . . . . . . . . . 3
2.2 Data Collecting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Identification and Removal of Missing or Duplicate Values 6
2.3.2 Handling of Inconsistencies . . . . . . . . . . . . . . . . . 7
3 Data Processing 8
3.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Results 11
4.1 Total Node Count . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Total Country Data . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Daily Country Data . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Duration of Node Activity . . . . . . . . . . . . . . . . . . . . . . 19
4.5 IPs Appearing in Multiple Cryptocurrency Networks . . . . . . . 20
iii
Contents iv
Bibliography 23
Introduction
1
1. Introduction 2
There has been extensive research on the peer-to-peer network and block prop-
agation mechanisms of various cryptocurrencies. For example, Decker and Wat-
tenhofer [4] study the spread of information in the Bitcoin network, while Kiffer
et al. [5] examine the inner workings of the Ethereum network.
Various research has also focused on the security of these systems. For exam-
ple, Heilman, Ethan, et al. [6] present and analyze an attack on the peer-to-peer
network of Bitcoin in their work, while Gervais, Arthur, et al. [7] study the secu-
rity of ’Proof of Work’ blockchains in their research. These studies provide insight
into the vulnerabilities and potential weaknesses of these decentralized systems
and are important for understanding the overall security of cryptocurrencies.
While these studies provide valuable insights into the functioning of individual
cryptocurrency networks, this research aims to take a broader perspective and
gain an overview of node activity across multiple cryptocurrencies.
Chapter 2
Data Collection
For this study, data on the node activities of multiple cryptocurrencies were
collected from publicly available node explorers and API sources. The node ex-
plorers used in this research were Bitnodes [8] for Bitcoin, Ethernodes [9] for
Ethereum, and Etcnodes [10] for Etherium Classic. Data from Bitnodes was ac-
quired with their API, while data from Etcnodes, and Ethernodes was scraped
with web scrapers. Additionally, data from Blockchair [11] was obtained through
the use of their API, which provides access to the node activities for multiple
cryptocurrencies, including Bitcoin, Bitcoin Cash, Dogecoin, Dash, Zcash, Lite-
coin, and Groestlcoin. All the code for data collection, as well as data analysis
and data plotting, were written in Python.
As both Bitnodes and Blockchair collect data on the Bitcoin network, one
would expect that their data would be consistent. Any discrepancies observed
may indicate the influence of additional factors.
Out of the four data sources used (Bitnodes, Ethernodes, Etcnodes, and
Blockchair), only Bitnodes included information on nodes utilizing the TOR
network. This is a valuable piece of information as it can provide insight into
potential disparities in activity between the TOR and non-TOR networks.
The process of obtaining data from the Bitnodes and Blockchair API was
straightforward, as it allowed for direct access to the current status of the respec-
tive networks in the form of JSON files. These files were subsequently converted
and saved as CSV files.
Obtaining data from Ethernodes and Etcnodes required the use of web scrap-
ing techniques, as these sources did not offer an API. Initially, an attempt was
made to extract the data by parsing the HTML of the websites using the Python
request library and searching for ’</table>’ entries. However, this approach was
unsuccessful as the tables were generated using JavaScript and were not present
in the HTML data when accessed using a Python script.
To overcome this challenge, a tutorial provided by Zoltan Bettenbuk ‘Build
3
2. Data Collection 4
To collect a comprehensive dataset for this study, data was collected from multi-
ple sources of cryptocurrency networks. Four custom-written web scrapers were
developed, one for each of the sources: Ethernodes, Etcnodes, Bitnodes, and
Blockchair. The Blockchair scraper was designed to go through all the cryp-
tocurrencies available in the Blockchair API.
A shell script was developed to run all the web scrapers to streamline and
automate the data collection process. This script was scheduled to run auto-
matically using crontab on a Linux machine at the Swiss Federal Institute of
Technology in Zurich. The data collection process was configured to run every
hour, starting from December 4th, 2022 and ending on January 10th, 2023. The
script was run every hour during testing to ensure accurate and up-to-date data
collection, as it was observed that the data changed hourly.
The collected data was stored in directories named after the source and in
CSV format. The files were named according to the website that was scraped,
with the date and hour appended to the end of the file name in the format
<sourcename yyyy-mm-dd hhmmss.csv>. In addition to the data fields provided
by each source, a field for "Creation Date" was added to the dataset. This field
refers to the date the CSV file was created and was added as an additional
measure to ensure that the data collection process and the data integrity can be
accurately tracked.
2. Data Collection 5
IP
qp6ro3mnogsi7manj3gdt5xhic43n45fumvg527z5uvtoy3vyp7re6yd.onion:8333
\[2a01:4f8:222:16d6::2\]:8333
Protocol Version User agent Connected Services Height
since
70016 /Satoshi:24.0.1/ 2023-01-09 1037 771476
70015 /Satoshi:0.18.1/ 2022-12-01 1037 771476
Hostname City Country Code Latitude Longitude
0 0
2a01:4f8:222:16d6::2 DE 51.3 9.49
AZN Organization name Creation Date
TOR Tor network 2023-01-11 20:00:00
AS24940 Hetzner Online GmbH 2023-01-11 20:00:00
Table 2.1: Example of Bitnodes CSV file format: header and the first couple of
rows
Table 2.2: Example of Blockchair CSV file format: header and the first couple of
rows
Table 2.3: Example of Ethernodes CSV file format: header and the first couple
of rows
2. Data Collection 6
Table 2.4: Example of ETC-Nodes CSV file format: header and the first couple
of rows
The data collection process generally ran smoothly, however, there were two
instances where data was not collected. The first instance of data loss occurred
on December 18th, 2022 from 8:00 to 16:00, where an outage on the Blockchair
API resulted in no data being collected. The second instance occurred from
January 4th, 2023 18:00 to January 5th, 2023 9:00, where no data was recorded
for Bitnodes, Blockchair-Bitcoin, Ethernodes, or Etcnodes. Despite efforts to
determine the cause of this data loss, the reason is still unknown.
It is worth mentioning that the instances of data loss, while unfortunate,
represented a small fraction of the overall dataset, and did not have a significant
impact on the study’s conclusions. Specifically, the total amount of data lost was
116 CSV files out of 9360, which represents roughly 1.23% of the dataset.
As mentioned in Section 2.2, there were two instances of data loss during the
data collection process.
It was decided that these instances of data loss would not significantly impact
the study’s overall findings as they represent a small proportion of the overall
dataset and were limited to a few hours rather than days. Therefore, this data
was not considered during the analysis. Additionally, it was also found that there
were missing values for certain entries, particularly for information that was not
considered crucial for the analysis of this study, such as the client version, city,
2. Data Collection 7
In the process of data cleaning, some inconsistencies were identified in the country
data. A Python script was used to compare the country data of the dataset with
that of a geo-location service, such as ipinfo.io [13]. It was found that for a small
number of entries (e.g. 100 out of 15’000 for bitnodes), the country information
was not consistent. To address this issue, a Python script was used to update all
inconsistencies with the country data obtained from ipinfo.io. It is important to
note that entries for which an error occurred during updating were removed as
there were only a maximum of 10 such occurrences per CSV file.
Chapter 3
Data Processing
The data was visualized through the implementation of several Python scripts.
These scripts were utilized to analyze the data and generate visualizations that
provided insight into the state and trends of the cryptocurrency networks.
Firstly, ‘count_nodes_per_day.py’ (A.2.5) was written to count the number
of active nodes per day for each cryptocurrency using the prepared consolidated
data. The results were then plotted as a line graph, with individual plots gener-
ated for each cryptocurrency and a single plot that included all cryptocurrencies
for comparison of scale.
Additionally, for Bitnodes two additional plots were created. One compares
the number of nodes using TOR in Bitnodes with the number of nodes not using
TOR, and another compares the number of nodes not using TOR in Bitnodes
with the number of nodes in Blockchair-Bitcoin, as both provide data on nodes
not using TOR for Bitcoin.
To supplement the line graphs, bar charts were created to show how many
nodes were active for how many days. This provided a visual representation of
the distribution of active nodes over time and helped to identify patterns in node
activity, such as how many nodes were only active for a short or extended period
of time.
8
3. Data Processing 9
Unfortunately, some nodes were located in countries that were not included
in ‘countries.csv’ used to obtain the coordinates. To resolve this problem, the
coordinates for Curaçao[15] and Andhra Pradesh[16] had to be added manually.
More scripts (A.2.8) were written to count the number of nodes that were
active for a certain number of days for each cryptocurrency. The results were
plotted as a bar chart, providing insight into the level of activity of the nodes
and allowing for the identification of any patterns or trends in the data, such as
nodes that were active for only a short time.
Finally, ‘check_for_double_ip.py’ (A.2.9) was used to analyze the frequency
of IP addresses appearing across multiple cryptocurrencies. The script processed
the consolidated data for each cryptocurrency and counted the number of oc-
currences for each IP address. The resulting data were then plotted as a bar
chart, where the x-axis represented the number of occurrences and the y-axis
represented the number of unique IP addresses with that number of occurrences.
Chapter 4
Results
From Figure 4.1, it can be seen, that the most widely used cryptocurrency is
Bitcoin, then followed by Zcash, Ethereum, and Dash. These are in the thousands
of nodes while Litecoin, Dogecoin, Bitcoin-Cash and Ethercoin are around the
1’000 range. The least used currency is Groestlcoin with a number under 100.
11
4. Results 12
The plots in Figure 4.2 show that the overall number of active nodes for
Bitcoin is experiencing a steady upward trend and that the increase stems from
an increase in nodes utilizing TOR, as the number of active nodes not utilizing
TOR is relatively stable, or potentially even decreasing.
Figure 4.2: Left: Node activity recorded by Bitnodes. Right: Node activity
recorded by Bitnodes split into nodes that use and do not use TOR
In Figure 4.5 we see that Blockchair-Dash and Etcnodes, both have seen a
sudden increase. However, more data would be needed to conclude if these are
only temporary or indicative of a sustained increase in activity.
In the analysis of node distribution by country, only the top 10 countries with the
highest number of nodes are represented in the visualizations, with all remaining
countries grouped together and labeled as "Other". It is crucial to acknowl-
edge that the total number of nodes in each country in the "Other" category is
significantly lower than that of the individually named countries.
Overall the USA (in yellow) has the highest amount of total nodes over all
cryptocurrencies, closely followed by Germany (in light blue). These two coun-
tries hold the top 2 places in almost all cryptocurrencies as seen in Figure 4.2
and Figure 4.2.
Figure 4.6: Total nodes per country over all cryptocurrencies combined.
4. Results 16
We can also make this observation by looking at the heatmap in Figure 4.9.
The two bright red spots over the USA and Germany indicate the highest amount
of activity.
4. Results 17
These plots show the amount of daily active nodes in the top 10 countries for
each cryptocurrency.
When comparing the daily country data with the total country data from the
previous subsection, we can see that the USA and Germany have swapped places
for Bitnodes, and Blockchair-Bitcoin.
To differentiate between the relative magnitudes of the various data, the total
country data was represented in a bar chart format.
4. Results 18
Figure 4.10: Comparison between daily (active) and total data for Bitnodes.
Figure 4.11: Comparison between daily (active) and total data for Blockchair-
Bitcoin
4. Results 19
The following plots show how many nodes were active for how many days.
A notable observation is that for Blockchair-Bitcoin-Cash and Blockchair-
Dash, a significant proportion of the nodes exhibit a high degree of activity
throughout the duration of the study as seen in Figure 4.12 from the peaks on
the right.
In Figure 4.14 it can be observed that about 17’000 IPs appear in multiple cryp-
tocurrencies but the majority of those are IPs that appear in both Bitnodes and
Blockchair-Bitcoin, which is expected as both gather the same data.
Figure 4.14: Left: This plot shows how many IPs appear in how many different
node explorers.
Right: This plot shows which combinations appear how often. All labels except
’bitnodes,bitcoin’ were removed for clarity. ’bitcoin’ refers to Blockchair-Bitcoin,
which was simplified to ’bitcoin’ for readability of the plot
From Section 4.1 we can clearly see that looking at the overall volume of cryp-
tocurrencies, Bitcoin is leading by a margin. This comes to no surprise as Bitcoin
represents the most well-known cryptocurrency today.
Section 4.2 shows that the majority of nodes are located in either the USA
(27.8%) or Germany (26.0%). It is interesting to note, however, that the dis-
tribution of the total number of nodes in Section 4.2 and that of the number
of active nodes in Section 4.3 show clear discrepancies. While the country with
the most Blockchair-Bitcoin nodes is Germany with 26.0%, the number of active
Blockchair-Bitcoin nodes is visibly higher in the USA (Figure 4.11).
One possible explanation can be given by comparing the activity of nodes.
In Section 4.4 we can see that more than half of the nodes recorded for Bitcoin
have been active for only a couple of days. If many of these inactive nodes are
those located in Germany, it would explain why Germany has a lower number of
active nodes than the USA, despite having a higher number of total nodes.
It is also important to note, the lines in Figure 4.3 do not perfectly match.
As both Bitnodes and Blockchair-Bitcoin focus on the same cryptocurrency, they
should theoretically gather the same data. One reason for the discrepancy could
lie in the methodology used by the node explorer to gather the data. Another ex-
planation could be that the data is received differently depending on the location
where the data is gathered.
This would coincide with the following discovery from the paper "Under the
hood of the ethereum gossip protocol." [5]
“We also find that a node’s location has a significant impact on when it
hears about blocks, and that the precise behaviour of this has changed
over time (e.g., nodes in the US have become less likely to hear about
new blocks first).”
21
5. Discussion and Conclusion 22
The code used for data collection, analysis and plotting within this Bachelor’s
thesis, is all self-written. As I do not specialize in web scraping, nor data collection
through APIs, the code is prone to have inefficiencies and insecurities, and thus
should not be used for operational purposes.
Working with large files (approx. 5GB of CSV files), also made it more
difficult to properly process the data and maintain data integrity. This translated
into a need for more complex data transformations to visualize and plot the data.
While I believe that the data collection and transformation process was properly
implemented, the code and the overall data collection and preparation process
should be iteratively reworked and optimized.
The data for the active nodes per country was taken from ipinfo.io. This re-
search paper assumes that the underlying information of the provider is accurate.
However, this might not be the case. Particularly for IP addresses that are asso-
ciated with virtual private networks (VPNs) or proxies, ipinfo.io’s service might
not always provide accurate results. As such, the results should not be taken for
granted and should be subject to constant scrutiny. To ensure the accuracy of
the data, more research into ipinfo.io’s methodology would be required.
The same is true for the overall data collection process. For this research
paper, only publicly available node explorers were used. This approach, due to its
simplicity, facilitated the data collection. However, it also added unpredictable
variables, as the use of node explorers prevented direct control over the data
collection methodology. One could circumvent this problem by obtaining the
data directly from the network. This would also ideally eliminate or resolve the
identified discrepancies between the data from Bitnodes and Blockchair-Bitcoin.
By gathering data directly from the networks, it would also be possible to
gather data on TOR usage for other cryptocurrencies, not just for Bitnodes,
which would allow for a more conclusive and complete analysis.
This said, I believe that this paper has achieved its goal of providing an
overview and understanding of the current landscape of cryptocurrencies, while
also identifying potential discrepancies, which build the basis for further research.
Bibliography
[1] “Peer-to-peer blockchain networks: The rise of p2p crypto exchanges,” https:
//learn.bybit.com/bybit-p2p-guide/peer-to-peer-blockchain-network/,
2022.
[12] B. Zoltan, “Build a javascript table web scraper with python in 5 steps,”
https://www.scraperapi.com/blog/scrape-javascript-tables-python/, 2022.
23
Bibliography 24
Source files
All of the code and the data are available on the following Gitlab Repository:
https://gitlab.ethz.ch/disco-students/hs22/changh-measuring-cryptocurrency-
networks
A.2 Scripts
Here are short descriptions of the scripts mentioned in the research. The scripts
are listed in order of execution. Please refer to the GitLab repository to see the
complete code.
A.2.1 Scrapers
A-1
Source files A-2
manage to request all the data at once and the maximum that worked for
me was 100) from
’ The data gets compiled into a CSV file named ’ethernodes yyyy-mm-dd
hhmmss.csv’, which gets saved into a folder ’ethernodes’ that the script
creates.
A.2.3 Mergers
• plot_dubplicated_ips.py: Plots
’/check_for_double_ip/duplicated_ips.csv’ as presented in Section 4.5
the country data was compared to the data given by ipinfo.io. Initially, those
discrepancies were just deleted, but as in the final data processing and analysis
the country data were updated with the data from ipinfo.io these scripts became
redundant.
Appendix B
All Plots
B-1
All Plots B-2
B.7 Heatmaps
These are the Heatmaps created with the help of folium. As these are interactive
HTML files please go to the Gitlab repository to view them in detail.