Blockchain Goes Green?
Blockchain Goes Green?
An Analysis of
Blockchain on Low-Power Nodes
Dumitrel Loghin∗, Gang Chen‡, Tien Tuan Anh Dinh∗, Beng Chin Ooi∗, Yong Meng Teo∗
∗
National University of Singapore, ‡Zhejiang University
∗
[dumitrel,dinhtta,ooibc,teoym]@comp.nus.edu.sg, ‡[email protected]
arXiv:1905.06520v2 [cs.DC] 17 Jun 2019
1
not only the analysis, but our experience of running model, ones can write general applications that operate on
blockchain on various system architectures. the blockchains states. Such applications are called smart
contracts. In this paper, we use BLOCKBENCH bench-
• We show that low-power ARM-based systems struggle marks which provides a set of smart contracts for Hyper-
to run full-fledged blockchain workloads mainly due to ledger, Ethereum and Parity.
insufficient memory size and bandwidth. For example, Depending on how nodes can join the network, the block-
the low-end Raspberry Pi 3 wimpy node is unable to chain is public or private (or permissioned ). In public net-
run Ethereum, and it requires non-trivial code modifi- works, anybody can join or leave and, thus, the security risks
cations and special configuration to run Hyperledger. are high. Most of the cryptocurrency blockchains are public,
• We show that systems with the lowest power profile such as Bitcoin [17] and Ethereum [7]. On the other hand,
do not necessarily achieve the best energy efficiency. private blockchains allow only authenticated peers to join
For example, Jetson TX2 is more energy-efficient than the network. Typically, private blockchains, such as Hyper-
Raspberry Pi 3, even if the latter has a lower power ledger [6] and Parity [18], are deployed inside or across big
profile. organizations.
Blockchains operate in a network of mutually distrusting
• We show that wimpy nodes can achieve reasonable peers, where some peers may not be just faulty but ma-
performance while saving significant amounts of en- licious. Hence, they assume a Byzantine environment, in
ergy. For example, eight Jetson TX2 nodes trade 17% contrast to the crash-failure model used by the majority of
and 72% of Parity and Hyperledger throughput, re- distributed systems. To ensure consistency among honest
spectively, for 18× and 23× lower energy consumption peers, most private blockchains use Byzantine fault-tolerant
compared to eight Xeon nodes. consensus protocols such as PBFT [8], whereas most public
blockchains use proof-of-work (PoW) consensus protocols.
• Our analysis of Ethereum performance leads to an in-
In PoW, participating nodes, called miners, need to solve a
sight into the design trade-off in newer Ethereum re-
difficult cryptographic puzzle. The node that solves the puz-
leases compared to older ones that are used in [11]. In
zle first has the right to append transactions to the ledger.
particular, the new design has lower throughput due
On the other hand, PBFT consists of exchanging O(n2 ) mes-
to the cost of many transaction execution restarts.
sages among the nodes to reach agreement on the transac-
The remainder of this paper is organized as follows. In tions to be appended to the blockchain. These consensus
Section 2 we present background and related work on block- protocols are considered the Achilles’ heel of blockchain due
chain systems. In Section 3 we describe the hardware sys- to poor time-energy performance. While PoW is scalable
tems and blockchain workloads used in this study. We also since it can run in parallel on all nodes, it is compute-
provide a detailed characterization of the hardware systems intensive and, thus, it is both slow and power-hungry on
in this section. In the next two sections, we analyze the time traditional brawny servers. PBFT exhibits quadratic time
and energy performance at single-node and cluster level. We growth with the number of nodes in the network, leading to
conclude in Section 6. energy wastage.
Our analysis in the next sections confirms that a PoW-
2. BACKGROUND AND RELATED WORK based blockchain, such as Ethereum, uses more power com-
pared to a PBFT- or PoA-based blockchain. A PBFT-based
In this section, we provide a background on blockchain
blockchain, such as Hyperledger, uses almost the same power
systems and survey the related work on time and energy
as a PoA-based blockchains, such as Parity, on small net-
performance analysis of blockchains.
works of up to eight nodes.
2.1 Blockchain Systems
A blockchain is a distributed ledger running on a network 2.2 The Time-Energy Analysis of Blockchains
of mutually distrusting nodes (or peers). The ledger is stored There are a number of related works that analyze perfor-
as a linked list (or chain) of blocks of transactions. The links mance of blockchains [11, 19]. However, only a few include
in the chain are built using cryptographic pointers to ensure energy analysis [21, 23], and the analysis is of limited depth.
that no one can tamper with the chain or with the data BLOCKBENCH [11] is a benchmarking suite compris-
inside a block. ing both simple (micro) benchmarks and complex (macro)
Blockchains are most famous for being the underlying benchmarks. The micro benchmarks, namely CPUHeavy,
technology of cryptocurrencies, but many blockchains are IOHeavy and Analytics, stress different subsystems such as
able to support general-purpose applications. This ability the CPU, memory and IO. On the other hand, YCSB macro
is determined by the execution engine and data model. For benchmark implements a key-value storage, while Smallbank
example, Bitcoin [17] supports only operations related to represents OLTP and simulates banking operations. These
cryptocurrency (or token) manipulation. On the other hand, benchmarks are implemented as smart contracts in Ethe-
Ethereum [7] can run arbitrary computations on its Turing- reum, Parity and Hyperledger. Their performance in terms
complete Ethereum Virtual Machine (EVM). At data model of throughput and latency is evaluated on traditional high-
level, there are at least three alternatives used in practice. performance servers with Intel Xeon CPU. In this paper, we
The Unspent Transaction Output (UTXO) model, used by extend BLOCKBENCH to include time-energy analysis of
Bitcoin, represents the ledger states as transaction ids and a wider range of systems, with focus on low-power nodes.
associated unspent amounts which are the input of future Sankaran et al. [21] analyze the time and energy perfor-
transactions. The account/balance model resembles a clas- mance of an in-house Ethereum network consisting of high-
sic banking ledger. A more generic model used by Hy- performance mining servers and low-power Raspberry Pi
perledger consists of key-value states. On top of the data clients. These low-power systems cannot run Ethereum min-
2
Table 1: Systems characterization
ing due to their limited memory size, hence, they only take 3.1 Systems
the role of clients. In this paper, we run the Ethereum full We compare the time and energy performance of low-
nodes on low-power devices with higher performance, such power systems against a high-performance traditional server
as Intel NUC and Jetson TX2. To the best of our knowledge, system. This server system is based on a x86/64 Intel Xeon
we are the first to run and analyze the time-energy perfor- E5-1650 v3 CPU clocked at 3.5GHz, and has 32GB DDR3
mance of full-fledged blockchains on low-power systems. memory, 2TB hard-disk (HDD) and 1Gbps networking in-
MobiChain [23] is an approach that allows mining on mo- terface card (NIC). It runs Ubuntu 14.04 with Linux kernel
bile devices running Android OS, in the context of mobile 3.13.0-95.
commerce. While containing analysis of both time and en- The low-power systems used for the analysis are (i) Intel
ergy performance, MobiChain has no comparison to other NUC [2], (ii) NVIDIA Jetson TX2 [14] and (iii) Raspberry
blockchains. In terms of energy analysis, the authors show Pi 3 model B (RP3) [1]. The NUC system is based on
that it is more energy-efficient to group multiple transac- a x86/64 Intel Core i3 CPU with two physical cores that
tions in a single block since there is less mining work and support Hyperthreading and run at 2.4GHz. This system
therefore less time and power wasted in this process. How- has 32GB DDR4, 256GB solid-state drive (SSD) and 1Gbps
ever, larger blocks increase latency and result in poor user NIC. It runs Ubuntu 16.04 with Linux kernel 4.15.0-34.
experience. The TX2 system is based on a heterogeneous 6-core 64-
Jupiter [16] is a blockchain designed for mobile devices. It bit CPU with two NVIDIA Denver cores and four ARM
aims to address the problem storing large ledger on mobile Cortex-A57 cores clocked at more than 2GHz. The system
devices with limited storage capacity. However, there is no has 8GB LPDDR4, a 32GB SD card and 1Gbps NIC. TX2
time or energy performance evaluation. is running Ubuntu 16.04 with Linux kernel 4.4.38-tegra of
To the best of our knowledge, we provide the first ex- aarch64 (64-bit ARM) architecture.
tensive time-energy performance analysis of blockchain sys- The RP3 has a 4-core ARM Cortex-A53 CPU of 64-bit
tems on low-power, wimpy nodes in comparison with high- ARM architecture and 1GB of LPDDR2 memory. This sys-
performance server systems. tem has a 64GB SD card that acts as storage and 100Mbps
NIC. It runs Debian 9 (stretch) with Linux kernel 4.9.80-v7+
(32-bit ARM).
3. EXPERIMENTAL SETUP We measure power and energy consumption of these sys-
tems with a Yokogawa power meter connected to the AC
In this section, we describe our experimental setup, start- lines. We report only AC power and energy values in this
ing with the systems and ending with the workloads. We paper. We believe that these values are more useful com-
present a detailed characterization of the selected systems pared to DC measurements since they reflect the final bill-
at CPU, memory, storage and networking levels. The results able energy.
of this detailed characterization are summarized in Table 1.
3
200000 140
30000 80 186924.9
127.7 Performance
70.6
Performance 180000 Power
25201.6 Power 70 120
25000 160000
Performance [IPS]
Performance [IPS] 60 140000 100
20000
Power [W]
50 120000
Power [W]
18022.3
80
100000
15000 40
12019.7 80000 60
67345.9
30
10000 60000 50582.4 40
13.8
20 40000
18.6
5000 20
8.8 3591.8
10 20000 11.7
11031.3
4.9
3
0 0
0 0 Xeon NUC TX2 RP3
Xeon NUC TX2 RP3 (12 cores) (4 cores) (6 cores) (4 cores)
(a) CoreMark on one core (b) CoreMark on all cores
350 100 350 100
Performance Performance
314.2
Power 300 Power
300
77.1 80 80
Performance [MBPS]
74.1
Performance [MBPS]
250 250
217.4
Power [W]
Power [W]
60 200 60
200
169.9
150 40 150
119.0 116.2
40
100 100
14.7 20 65.2
20
14.7
50 50
4.9
1.0 2.4 4.8
0.9 2.3
0 0 0 0
Xeon NUC TX2 RP3
Xeon NUC TX2 RP3
(c) Keccak256 on one core (d) Keccak512 on one core
3.2 Systems Characterization At multi-core level, TX2 exhibits better performance than
Before analyzing the time and energy of blockchains on NUC, mainly because of its six real cores compared to only
the selected systems, we evaluate the hardware at CPU, two real cores on NUC. Moreover, TX2 uses less power than
memory, storage and networking level to understand their NUC to deliver higher performance. Therefore, it is ex-
relative performance. The measured values and system char- pected that TX2 has a better time-energy performance for
acteristics are summarized in Table 1. multi-threaded workloads. We also observe that the perfor-
We first measure idle system power when the hardware is mance is not scaling perfectly with the number of cores. For
running only the OS. We obtain 50W, 9W, 2.4W and 1.9W example, Xeon exhibits only 7.4 times performance boost
for Xeon, NUC, TX2 and RP3, respectively. These values when 12 cores are used. TX2 is performing better, with
clearly show the power efficiency gap between brawny nodes a 5.6 performance increase when 6 cores are used. This
used in the majority of datacenters and supercomputers, and sub-linearity is due to resource contention, both in-core and
wimpy nodes used at the edge. off-core [24].
To assess CPU performance, we use CoreMark benchmark Blockchain systems rely heavily on cryptographic opera-
which is increasingly used by the industry, including vendors tions that are CPU-intensive. We evaluate the CPU on run-
that equip their systems with ARM CPUs [4]. CoreMark ning this type of workload by measuring the performance
measures CPU performance in terms of iterations per second and average power of Keccak secure hash algorithm from
(IPS). We present the performance and average power usage go-ethereum v1.8.15, compiled with go 1.11. We run both
in Figure 1a and Figure 1b for CoreMark running on a single Keccak256 and Keccak512 on a random input of one bil-
core and all cores, respectively. For multi-core analysis, we lion bytes. The throughput measured in MB per second
enable all available cores, including virtual cores in systems (MBPS) represents the performance of these cryptographic
that support Hyperthreading. For example, we use twelve algorithms on the selected systems. As shown in Figure 1c
and four virtual cores on Xeon and NUC, respectively. and Figure 1d, the performance trends are similar to Core-
At single-core level, the performance of Xeon is 1.4, 2.1 Mark. RP3 exhibits much lower performance: almost 320×
and 7 times higher compared to NUC, TX2 and RP3, re- and 190× lower throughput compared to Xeon on Keccak256
spectively. But this performance comes at the cost of 5.1×, and Keccak512, respectively. The lower system power of
8× and 23.5× higher power consumption. However, we note RP3 running these cryptographic operations compared to
that this is the power used by the entire system which in- CoreMark suggests that the core is not fully utilized. In
cludes other components beside the CPU. We then estimate fact, it is often stuck in memory operations that use less
the power of CPU by subtracting the values for idle system power compared to arithmetic operations. As we shall see
power. One Xeon core uses almost 20W, while one ARM in the next paragraph, RP3’s memory has significantly lower
core from RP3 uses only 1.1W. Hence, the performance-to- bandwidth than the other three systems.
power ratio (PPR) of the RP3 is superior to that of the We analyze the performance of the memory subsystem in
Xeon. terms of bandwidth. We use lmbench [22] to get the read-
4
131072
Xeon TX2 Observation 1. In summary, the hardware systems have
NUC RP3
65536 the following characteristics.
16384 tems at memory and storage level while using 5× less power.
8192
However, CPU performance is lower when running multi-
threaded workloads due to the small number of cores.
4096
Observation 1.2. High-end ARM-based wimpy devices,
2048
such as Jetson TX2, have potential to achieve high PPR
1024 at the cost of lower time performance compared to x86/64
systems.
512
1kB 32kB 1MB 32MB 1GB
Observation 1.3. Low-end ARM-based devices, such as
Memory Access Size [MB]
Raspberry Pi 3, suffer from low core clock frequency, small
Figure 2: Memory bandwidth comparison and low-bandwidth memory. These systems may not be able
to run modern server-class workloads, including blockchains.
3.3 Workloads
write bandwidth and plot the results in Figure 2. At level
We use BLOCKBENCH [11] with minor changes1 to as-
one cache (L1), Xeon has the highest bandwidth, which is
sess blockchain performance. We were not able to com-
almost 60GB/s, while NUC, TX2 and RP3 exhibit band-
pile go-ethereum v1.4.18 evaluted in the original BLOCK-
widths of 37GB/s, 19GB/s and 6GB/s, respectively. This
BENCH paper [11] on TX2 due to issues with older versions
is expected since server-class processors, such as Xeon, have
of go toolchain on aarch64 architecture. We also encoun-
optimized caches. However, at the main memory level, NUC
tered issues with the compilation of parity-ethereum v1.6.0
leads with a bandwidth of 12.5GB/s, followed closely by
on all systems due to broken Rust packages. Hence, we
Xeon with 10GB/s. This lower performance of Xeon is at-
use go-ethereum v1.8.15 compiled with go 1.11 and parity-
tributed to the older DDR3 memory generation. TX2 and
ethereum v2.1.6 compiled with cargo 1.30.0 on all systems.
RP3 exhibit main memory bandwidths of less than 4GB/s
For Hyperledger experiments we use version v0.6 which sup-
and 1GB/s, respectively. This low bandwidth, together with
ports PBFT consensus.
small memory size hinder the execution of modern workloads
The micro-benchmarks in BLOCKBENCH assess the per-
on wimpy systems.
formance of different subsystems. CPUHeavy uses quicksort
At storage level, there is a mixed performance profile since
to sort an array of integers, while IOHeavy implements Write
the systems are equipped with different types of storage
and Scan operations that touch key-value pairs to stress
mediums. To assess throughput and latency, we use dd and
the memory and IO subsystems. The analytics benchmark
ioping Linux commands, respectively. As expected, the SSD
simulates typical OLAP workloads as found in traditional
of NUC exhibits the highest throughput and the lowest la-
databases. It implements three queries. The first query (Q1)
tency. On the other hand, the SD cards used by TX2 and
computes the total value of transactions between two blocks.
RP3 exhibit low throughput, high latency and significant
The second (Q2) and third (Q3) computes the maximum
read/write asymmetry. Since modern operating systems
transaction and the maximum account balance, respectively,
are caching files or chunks in memory, we also measure the
between two blocks for a given account. This benchmark
buffered read throughput. We observe that this throughput
requires an initialization step that creates 120,000 accounts
follows the memory bandwidth trend, except for NUC where
and generates over 100,000 blocks with an average of three
the buffered throughput of 6.6GB/s is half of the memory
transactions per block.
bandwidth.
The macro-benchmarks in BLOCKBENCH are complex
At networking level, we measure the bandwidth and la-
database applications stressing all key subsystems. For ex-
tency using iperf and ping Linux commands, respectively.
ample, YCSB evaluates the performance of a key-value store
As expected, RP3 exhibits lower TCP and UDP bandwidths
with configurable read-write ratios, while Smallbank repre-
since it is equipped with 100Mbps NIC, compared to the
sents OLTP workloads by simulating banking transactions.
Gigabit Ethernet NICs of the other systems. The slightly
Donothing benchmark estimates the overhead of consensus
higher latency of TX2 and RP3 can be attributed to the
protocols since it performs no computations and no IO op-
lower clock frequency of the wimpy systems. To validate this
erations inside the smart contract. In this paper, the macro-
hypothesis, we have measured the networking latency while
benchmarks are run on clusters of nodes.
setting the clock frequency to a fixed step. TX2 supports
All workloads are run at least three times. We report the
twelve frequency steps in the range 346MHz-2.04GHz. We
average values and standard deviations.
obtained a Pearson correlation coefficient of -0.93 between
the twelve frequency steps and corresponding latencies, sug- 3.4 Raspberry Pi 3 (RP3) Setup
gesting strong inverse proportionality. For example, the net-
RP3 is unable to run go-ethereum since it has only 1GB
working latency at 346MHz is 0.33ms, while at 2.04GHz is
of RAM while Ethereum requires more than 4GB. Modify-
0.25ms. On RP3, there are only two available frequency
ing go-ethereum to run on low-end wimpy devices is left to
steps, but we obtained similar results. While setting the
frequency to 600MHz and 1.2GHz, we obtained networking 1
The updated source code of BLOCKBENCH is available at
latencies of 0.35ms and 0.29ms, respectively. https://github.com/dloghin/blockbench
5
450
future work. In this paper, we only report the performance W/O Explicit GC Invocation, Swappiness=60
6
Table 2: Time, Power and PPR of Hyperledger3
Execution Time [s] Power [W]
Performance-to-Power Ratio [ops/J]
Workload Size Average Std. dev. Average Std. dev.
Xeon NUC TX2 RP3 Xeon NUC TX2 RP3 Xeon NUC TX2 RP3 Xeon NUC TX2 RP3 Xeon NUC TX2 RP3
1000000 1.0 1.0 1.1 2.4 0.0 0.0 0.0 1.7 50.6 9.0 2.4 2.1 1.7 0.2 0.0 0.1 19,459.2 106,885.5 383,004.3 308,875.7
CPUHeavy 10000000 1.2 1.2 1.7 2.5 0.0 0.0 0.0 0.0 52.0 10.9 3.8 2.7 1.0 0.7 0.6 0.2 165,735.7 739,181.8 1,597,910.5 1,494,017.5
100000000 2.8 3.9 8.3 17.7 0.0 0.2 0.2 1.1 72.9 14.8 4.6 2.9 1.0 1.2 0.1 0.1 485,530.3 1,720,378.6 2,607,959.9 1,978,696.9
3200000 1,055.2 1,721.1 7,365.3 11,911.7 52.3 88.9 78.3 142.1 83.1 17.9 3.7 3.3 0.1 0.0 0.0 0.0 36.6 104.4 117.2 82.3
IOHeavy Write 6400000 2,125.1 3,473.2 14,675.3 27,246.0 85.3 139.9 114.0 309.6 83.0 17.9 4.5 3.1 0.1 0.1 1.1 0.0 36.3 103.1 102.9 74.9
12800000 4,299.0 7,025.0 28,957.0 63,891.3 102.5 162.7 751.9 917.2 83.0 17.9 3.7 3.0 0.1 0.0 0.1 0.0 35.9 101.6 120.1 66.2
3200000 744.1 1,442.0 6,191.7 8,915.3 5.9 7.2 79.3 42.6 83.7 15.1 3.1 3.2 0.1 0.0 0.0 0.0 51.4 147.1 169.3 112.4
IOHeavy Scan 6400000 1,487.1 2,871.4 10,960.3 17,296.3 11.5 26.5 2195.8 118.8 83.6 15.1 3.6 3.2 0.0 0.0 0.8 0.0 51.5 147.4 169.4 114.6
12800000 2,966.1 5,768.3 25,049.0 34,274.7 20.2 84.4 257.7 702.6 83.6 15.1 3.0 3.2 0.1 0.1 0.0 0.0 51.6 147.3 167.7 115.3
Analytics Q1 10000 9.7 18.8 157.6 103.8 0.0 0.0 0.3 5.0 90.3 18.1 2.9 3.2 0.5 0.1 0.0 0.0 11.4 29.5 21.8 30.6
6000 1400
1200 15000
5000
1000
Time [s]
Time [s]
Time [s]
4000
800 10000
3000
600
2000
400 5000
1000 200
0 0 0
2 4 6 8 10 12 1 2 3 4 1 2 3 4 5 6
miner threads miner threads miner threads
input size increases because (i) the CPU utilization is low takes longer as the number of miner threads increase. This
but roughly constant, and (ii) the memory and storage have would suggest that EVM is inefficient when more miner
lower dynamic power fluctuations [5]. For example, the av- threads are used.
erage CPU utilization is 9.5% (3.5% standard deviation) and We observed high variations among the executions of the
8.1% (3.6% standard deviation) during the IOHeavy Write same benchmark with the same number of miner threads.
and Scan, respectively, for 3.2 million key-value pairs. Table 3 shows five executions of CPUHeavy on NUC with
The energy is the product of execution time and average four miner threads. The high variations are visible in go-
power usage. Xeon and RP3 exhibit the highest energy cost ethereum releases starting from v1.8.14. Previous releases,
due to high power usage for the former and long execution including v1.4.18 used by BLOCKBENCH [11], exhibit rel-
time for the latter. On the other hand, TX2 and NUC use atively stable execution. We found that starting with go-
almost always the lowest energy. TX2 is almost always the ethereum v1.8.14 a transaction is started, or applied, mul-
most efficient because of its lower power profile compared tiple times and that this number is inconsistent among dif-
to NUC, and higher performance compared to RP3. We ferent runs. For example, a CPUHeavy transaction is ap-
note that even if RP3 has a very low power profile, its mem- plied as few as 16 and as many as 481 times in go-ethereum
ory and CPU limitation translate to larger energy cost than v1.8.14. This is explained by the fact that go-ethereum
systems with higher power profile. v1.8.14 underwent a significant design change. Specifically,
In summary, we make the following observation concern- whenever a miner thread receives a new block, it discards
ing Hyperledger execution. any transactions currently being executed and applies the
Observation 2. The highest energy efficiency is achieved transactions in the newly received block. In our case, there
by low-power systems with balanced performance-to-power is a single transaction, and during its execution the miners
profile, rather than systems with low power profile but also keep generating empty blocks. As a result, the probability of
low performance. receiving a block during the transaction execution increases
with the number of miners. Therefore, the transaction is in-
4.2 Ethereum terrupted and restarted many times. As we shall see in the
Figure 4 shows a super-linear increase in the execution next section, the same results hold for the cluster setting
time of go-ethereum (geth) v1.8.15 with increasing number and for non-CPU heavy workloads.
of miner threads on three systems under evaluation. Recall We note that this design works well when a newly received
that RP3 is unable to run Ethereum. To investigate the block contains updates to states currently used by the trans-
cause of high execution time with more miner threads, we action being executed. In this case, it saves time to stop
split break down execution time to three components as de- and restart the current transaction until after the new block
scribed in BLOCKBENCH [11]. We profiled geth with Go is applied. However, interrupting transactions even when
pprof and analyzing both the callgraph and the cumulative receiving empty block results in unnecessary overhead. A
execution time per routine. From our analysis, the consen- more elegant approach is to restart only transactions whose
sus starts with the call of go-ethereum/consensus/ethash. states are affected by the new block.
(*Ethash).Seal.func1, while the execution starts with the In summary, we make the following observation concern-
invocation, go-ethereum/core/vm.(*EVMInterpreter).Run. ing Ethereum execution.
The remaining time is spent at application layer and data
layer of the blockchain stack. We observed that execution Observation 3. In the latest versions of Ethereum, (i)
7
Table 3: Comparison of five CPUHeavy(1M) executions with four miner threads on NUC with different versions of Ethereum
execution time increases with the number of miner threads 2 times. In contrast, the execution of IOHeavy Write of
and (ii) there is high execution time variation among dif- 10, 000 key-value pairs finishes in 458s and is restarted 63
ferent runs of the same workload, especially when the work- times.
load is computation heavy or when more miner threads are We note that we were unable to run CPUHeavy with input
used. These are due to new transaction restarting mecha- size of 10M and 100M. While BLOCKBENCH paper [11]
nism which restarts execution when receiving a new block, reports execution times for Ethereum CPUHeavy on 10M
even if that block is empty. input, in our experiments the clients never finish the execu-
tion.
Time-energy performance. The time, power and PPR
of Ethereum runnign with one miner thread are shown in Ta- 4.3 Parity
ble 4. Across systems, we observe the same pattern as that The time-power results for Parity are presented in Table 5.
of Hyperledger. In other words, Observation 2 holds. Even Unlike Ethereum, Parity is able to run on the wimpy RP3
if TX2 exhibits the highest execution time in general, its system. On the other hand, all systems are not able to run
energy usage is the lowest and, thus, its PPR is the highest. the CPUHeavy workload with 100M input.
For example, the IOHeavy Scan benchmark with 10, 000 key- Recall that RP3 is 2 − 3× slower than TX2 for Hyper-
value pairs is 3.2× slower on TX2, but uses 4.9× less energy ledger. This gap is much bigger for Parity. In particular,
than Xeon. RP3 is 8× slower than TX2 when running CPUHeavy with
As expected, Ethereum uses more power than Hyper- 10M input. Our profiling using Linux perf shows that RP3
ledger. In particular, for sorting 1M values, Xeon, NUC spends significant time in libarmmem.so which is a library
and TX2 use 50.6W, 9W and 2.4W, respectively, in Hyper- for memory operations for ARM-based systems. This, to-
ledger, as opposed to 81W, 17W and 5W, respectively, in gether with a low CPU utilization of 10%, suggest that mem-
Ethereum. There are two reasons for this behavior. First, ory is the main bottleneck of Parity execution on RP3. In
Ethereum use more cryptographic operations which incur contrast, the other systems spend most of the time in the
high CPU utilization. Second, Ethereum uses EVM, an in- execution layer, i.e., inside EVM interpreter.
terpreted execution environment which is less efficient than The variability in execution time among different runs is
Hyperledger’s Docker execution. Consequently, the CPU less visible in Parity compared to Ethereum. Table 5 shows
performs more work in Ethereum. high standard deviations only for IOHeavy workloads with
Our evaluation demonstrates high variability, especially small input size. We attribute this to the memory hierar-
for IOHeavy operations, as indicated by the high standard chy, especially to CPU caches and memory buffers that need
deviation in Table 4. Execution profiling of IOHeavy Write time to warm-up and may exhibit unpredictable behavior on
shows that much of the time is spent in the EVM interpreter. shorter executions. Indeed, CPUHeavy and Analytics do not
For example, the writing of 10, 000 key-value pairs on Xeon exhibit execution time variability. The former is not mem-
spends 71% of the time inside EVM interpreter, while sort- ory or I/O intensive. The latter includes an initialization
ing one million numbers spends only 10% in the same rou- step that warms up the caches and memory buffers.
tine. The root cause is the same as for running multiple As expected, the power consumption of Parity is lower
miner threads, namely, the transaction is restarted multiple compared to Ethereum, but higher when compared to Hy-
times until it manages to finish. Transactions that perform perledger. Taking CPUHeavy workload as example, Xeon,
more work and take longer to finish, have higher chances to NUC and TX2 use 57.6W, 12.7W and 3.6W, respectively,
be restarted and, thus, take even longer to finish under geth to sort one million values in Parity. For the same amount
v1.8.15. For example, an execution of sorting one million of work, Ethereum consumes 81W, 17W, and 5W on Xeon,
numbers on Xeon finishes in 9s and restarts the transaction NUC and TX2, respectively, while Hyperledger consumes
8
Table 5: Time, Power and PPR of Parity
Execution Time [s] Power [W]
Performance-to-Power Ratio [ops/J]
Workload Size Average Std. dev. Average Std. dev.
Xeon NUC TX2 RP3 Xeon NUC TX2 RP3 Xeon NUC TX2 RP3 Xeon NUC TX2 RP3 Xeon NUC TX2 RP3
1000000 64.9 71.1 147.1 1,205.5 26.0 0.0 3.7 6.5 57.6 12.7 3.6 2.6 1.2 0.3 0.1 0.0 324.2 1,106.6 1,910.1 316.0
CPUHeavy
10000000 469.7 705.1 1,371.0 12,205.4 4.3 0.2 83.0 367.2 71.5 14.7 4.5 2.7 1.2 0.4 0.0 0.0 298.1 967.4 1,634.5 302.0
100 84.8 42.1 30.8 62.2 15.3 15.2 0.2 45.3 50.7 8.8 2.5 2.1 0.8 0.0 0.0 0.0 0.0 0.3 1.3 1.2
IOHeavy Write 1000 170.4 96.0 106.7 287.0 84.1 26.0 40.0 5.6 52.8 10.5 3.0 2.5 0.1 0.2 0.1 0.1 0.1 1.1 3.6 1.4
10000 124.7 186.9 380.7 2,996.5 0.4 5.8 1.1 25.8 71.0 14.2 4.2 2.7 1.8 0.1 0.1 0.0 1.1 3.7 6.3 1.2
100 63.1 52.6 30.7 82.5 0.0 15.2 0.0 40.0 50.0 8.8 2.5 1.9 0.8 0.1 0.0 0.0 0.0 0.3 1.3 0.9
IOHeavy Scan 1000 42.2 149.0 51.9 72.2 15.1 30.0 30.0 40.0 50.1 8.7 2.5 2.0 0.9 0.0 0.1 0.0 0.5 0.8 10.1 9.7
10000 52.6 191.7 30.3 112.7 14.9 78.3 1.2 2.8 51.0 9.5 2.8 2.5 0.9 0.1 0.1 0.0 4.1 6.7 119.6 35.5
Analytics Q1 1000 1.2 2.0 10.5 14.2 0.0 0.0 0.3 2.3 51.1 14.0 2.8 2.5 1.3 1.4 0.0 0.2 16.3 36.9 34.8 28.1
Analytics Q2 1000 1.2 1.9 10.2 14.6 0.0 0.0 0.2 1.4 49.5 14.4 2.7 2.5 1.7 1.1 0.0 0.1 17.1 36.0 35.7 27.8
Analytics Q3 1000 0.5 0.7 1.8 4.0 0.0 0.0 0.1 0.0 50.3 13.5 4.6 2.9 1.7 2.3 0.1 0.0 42.1 102.8 108.5 85.0
Table 6: Impact of storage subsystem the idle power. In contrast, the HDD adds 3.5W to the idle
power, thus increasing it by 2.5×. The HDD has more than
System
Metric
TX2+SDC TX2+SSD TX2+HDD
5× higher write throughput but similar read throughput
Idle System Power [W] 2.4 2.9 5.9 compared to TX2+SDC.
Write Throughput [MB/s] 16.3 206.0 87.6 Interestingly, Jetson with SD card exhibits slightly better
Read Throughput [MB/s] 88.9 277.0 93.4 execution time when running the IOHeavy. We attribute
Write Latency [ms] 17.1 2.8 13.7
Read Latency [ms] 2.8 1.8 1.2
this to the fact that the SD card stores the OS, libraries
IOHeavy Write (10000) and Docker containers in all three configurations, including
Hyperledger Time [s] 18.2 22.4 24.0 the TX2+SSD and TX2+HDD. Hence, the ledger storage
Parity Time [s] 380.7 382.8 386.2 subsystem is not a bottleneck, otherwise, TX2+SDC would
Hyperledger Power [W] 7.2 7.7 10.0 exhibit higher execution time due to its lower raw through-
Parity Power [W] 4.2 4.8 7.7
IOHeavy Scan (10000)
put and higher latency. In fact, our profiling of IOHeavy
Hyperledger Time [s] 28.8 31.7 29.8 Write shows that write operations are sparse, with an aver-
Parity Time [s] 30.3 32.5 43.1 age of 1MB/s and a peak of 21.5MB/s across all subsystems.
Hyperledger Power [W] 3.1 3.6 6.6 These values are within the capabilities of all storage sub-
Parity Power [W] 2.8 3.3 6.4
systems, but switching between the execution context of the
SD card and the ledger storage may induce overhead.
In terms of power, we observe that IOHeavy Scan adds
50.6W, 9W and 2.4W, respectively. This behavior can be only 0.7W and 0.4W to the idle power off all system config-
explained by the lower power overhead of Parity’s PoA con- urations for Hyperledger and Parity, respectively. IOHeavy
sensus protocol compared to Ethereum’s PoW. Parity also Write uses more power, adding between 4.1W and 4.8W for
has an interpreted EVM which is not as efficient as Hy- Hyperledger and around 1.8W for Parity. These results are
perledger’s Docker execution engine, and which draws addi- stable, in general. The only notable exception is IOHeavy
tional power. Scan in Parity on the TX2+HDD which, in general, finishes
Observation 2 holds also for Parity. In particular, Xeon in 32.5s, but in some cases it finishes in 64s or 97s.
and NUC are the fastest systems, while TX2 uses the small- In summary, we make the following observation.
est amount of energy due to its shorter execution than RP3
and lower power usage than Xeon and NUC. Observation 4. Wimpy nodes can accommodate conven-
tional storage subsystems of large capacity, therefore they
4.4 Impact of Storage Subsystem can store large ledgers. The storage subsystem type does not
In this section, we analyze the impact of different types significantly affect the I/O performance of Hyperledger and
of storage subsystems on blockchain performance using the Parity.
IOHeavy benchmarks. We select the TX2 system which
has interfaces for SD card, SATA storage and USB3 de- 4.5 Bootstrapping Performance
vices. While the baseline is the system with a 64GB SD card In this section, we analyze the performance of bootstrap-
(TX2+SDC), we separately connect a 1TB SSD through ping which is the process of joining a blockchain network
SATA (TX2+SSD) and an external 2TB HDD through and synchronizing the distributed ledger. We consider one
USB3 (TX2+HDD). The SD card stores the OS in both node that joins an existing network of seven other nodes of
TX2+SSD and TX2+HDD. the same type. Prior to the bootstrapping process, we gen-
We first measure the IO performance in terms of raw erate over 100 blocks by running the YCSB workload on the
read/write throughput and latency. Then, we measure the 8-node blockchain network. We then stop one node, delete
performance of IOHeavy benchmarks in Hyperledger and its ledger and caches, and restart it so that it synchronizes
Parity. Ethereum is not included in this analysis because the ledger with other nodes.
of its unpredictable behaviour, as discussed in Section 4.2. Hyperledger v0.6 adopts a lazy bootstrapping approach
In addition, we evaluate the impact of storage on the to- where the synchronization is started when new transactions
tal power by measuring the idle power when the hardware is are submitted. Hence, the execution time and power of syn-
running only the OS, and the active power during blockchain chronization and of transaction cannot be clearly separated.
execution. The results are summarized in Table 6. Here, we report the time taken by Hyperledger to update
In terms of raw performance, the SSD is the clear winner. its block tip to a certain value. To synchronize around 2750
It has almost 13× higher write throughput and 3× higher blocks, Hyperledger on Xeon takes 40s while using 51.25W.
read throughput than SD card, while adding only 0.5W to Interestingly, TX2 is faster than Xeon: it takes less than 20s
9
Xeon, Hyperledger TX2, Hyperledger 300 Xeon, Hyperledger TX2, Hyperledger
Xeon, Hyperledger TX2, Hyperledger
Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum
Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity
250
1000 1000
200
Latency [s]
100 150
100
100
10
50
1 0 10
10 100 1000 10 100 1000 10 100 1000
Transaction rate [tps] Transaction rate [tps] Transaction rate [tps]
while using up to 3W. We attribute this to networking setup. 5.1 Homogeneous Cluster
In particular, the Xeon cluster runs on NFS which adds some We consider Xeon-only and TX2-only clusters. The for-
overhead. We note that both systems use a relatively low mer is the faster, the latter the most energy efficient. We
power compared to their peak power. This is because the vary the cluster size from 2 to 8 nodes. The clients that is-
blocks are downloaded from the other peers without execut- sues requests run on separate nodes, and unlike the analysis
ing all transactions. in [21], they are not included in our performance evaluation.
Ethereum supports three bootstrapping modes, fast, full Our main focus is on the blockchain nodes.
and light [13]. In light mode, which is intended for wimpy
systems, only the current state is downloaded from other 5.1.1 Impact of request rate
peers. In fast and full mode, all blocks are downloaded. We first examine the throughput, latency and power us-
However, only in full mode are all the transactions applied, age with increasing request rate. We fix the cluster size
which means it is slower than fast mode. In our experiments, to 8 nodes, and use 8 to send transactions. We increase
we do not consider light mode because it is very fast on both the transaction rate from 8 to 4096 transactions per second
wimpy and brawny nodes. Ethereum takes 14.8s and 28.8s (tps). The results, depicted in Figure 5 for the YCSB bench-
to synchronize around 2000 blocks in fast and full mode mark, show that Hyperledger is able to sustain a throughput
on Xeon, respectively, while using 120W. On TX2, it takes of up to 2220 tps and 630 tps on Xeon and TX2, respectively.
57.2s and 6.7W to synchronize in full mode, and only 4s and Ethereum achieves a throughput of up to 39.7 tps and only
6W to synchronize in fast mode. 3.3 tps on Xeon and TX2, respectively. Parity achieves a
By default, Parity uses fast (or warp) synchronization maximum throughput of 30 tps and 25 tps on Xeon and
which skips “almost all of the block processing” [12]. How- TX2, respectively, when the client request rate is 512tps.
ever, we observed that synchronizing the ledger in Parity Similar patterns are observed when running Smallbank and
takes much longer than Ethereum, even when warp syncing Donothing benchmarks.
is on. In particular, synchronizing 100 blocks in Parity takes To achieve peak throughput, Hyperledger uses 618W on
over 4 hours on Xeon and over 3 hours on TX2, whereas in Xeon and only 26.4W on TX2. Parity uses even less power,
Ethereum it takes 2.6s and 2.2s on Xeon and TX2, respec- ranging between 400W and 480W on Xeon, and between
tively. This is a well-known issue in Parity4 , with some users 20W and 26W on TX2. In contrast, Ethereum uses the
blaming the I/O subsystem. But our profiling shows that most power, between 860W and 900W on Xeon and around
the peak I/O write rate is around 1MB/s which is much 49W on TX2.
lower than the available throughput of the storage system. These results can be summarized in the following obser-
Moreover, the power during Parity’s synchronization is close vation.
to the idle power: 51W and 2.4W on Xeon and TX2, respec-
tively. This shows that Parity is not doing much work during Observation 5. Higher-end wimpy nodes, such as Jet-
the synchronization process. We therefore conclude that the son TX2, achieve around one-third of Hyperledger through-
synchronization inefficiency lies in Parity’s implementation put and almost the same performance for Parity compared
rather than in the hardware. to brawny Xeon nodes, while using 18× to 23× less power.
These nodes have potential of achieving significant power
5. CLUSTER ANALYSIS and cost savings.
In this section, we analyze the time-energy performance
Standard deviation is relatively low in most of the cases,
of blockchains on a cluster. We consider both homogeneous
with the highest outliers being the latency under high re-
consisting of nodes of the same type, and heterogeneous clus-
quest rates. In particular, Hyperledger’s latency exhibits a
ter consisting of multiple types of nodes.
standard deviation of 111.5% and 42.5% on Xeon with 4096
4
For example, users report on StackExchange that syn- tps and TX2 with 1024 tps request rate, respectively. For
chronizing with the main network in 2018 took few days throughput, the maximum standard deviation is 101.1% for
(https://bit.ly/2UvIR1g) Ethereum on TX2 and 30.8% for Parity on Xeon. Power
10
Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger
Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum
Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity
150
Latency [s]
1000
100
100 100
10
10 50
1 0 1
2 4 6 8 2 4 6 8 2 4 6 8
Nodes Nodes Nodes
(a) Throughput (b) Latency (c) Power
CDF
0.6 geth v1.4.18
120 geth v1.8.15
Throughput [tps]
250
0.4
Latency [s]
100
200 0.2
80
150 0
60 0 20 40 60 80 100 120 140 160 180 200
100 Apply Transaction Count
40
50 20 Figure 8: Comparison of apply transaction count distribu-
tion in two versions of Ethereum
0 0
10 100 1000
Transaction rate [tps]
times a transaction is applied is similar, which is 20 times,
Figure 7: Throughput and latency comparison between dif- we observed that a higher number of unique transactions are
ferent versions of Ethereum executed by geth v1.4.18 than by geth v1.8.15. These unique
transactions are reflected in the throughput. Furthermore,
transactions are restarted more often in geth v1.8.15, as
consumptions have low standard deviation: below 1% on shown in Figure 8. The maximum number of restarts in
Xeon and 4.5% on TX2. v1.8.15 is much higher than v1.4.18, namely 183 times ver-
Ethereum execution on TX2 is irregular, as shown in Fig- sus 105 times.
ure 5, and has higher standard deviation compared to the
other two blockchains. Moreover, Ethereum throughput 5.1.2 Impact of network size
is much lower and its latency is higher when using ver- Next, we examine the scalability with increasing number
sion v1.8.15 compared to v1.4.18 evaluated in BLOCK- of blockchain nodes and clients. We use the same number
BENCH [11]. As shown in Figure 7 for YCSB, v1.4.18 of clients as the number of nodes. We choose a request
achieves a maximum of 284.4 tps for a transaction request rate that saturates the systems, as identified in the previous
rate of 1024 tps, while v1.8.15 achieves only 39.7 tps. The section. In particular, for Xeon we set the rate per client
increase in latency is relatively smaller, with maximum la- node to 512, 8 and 64 tps for Hyperledger, Ethereum and
tencies of 137 and 154 seconds for v1.4.18 and v1.8.15, re- Parity, respectively. On TX2, we set the rate per client to
spectively. 128, 4 and 64 tps.
We note that the higher throughput reported for v1.4.18 Figure 6 shows the throughput for YCSB with increasing
is attributed to (i) different parameter settings, and more number of nodes. We attribute the fluctuations of Ethe-
fundamentally to (ii) a design change in Ethereum. First, reum on TX2 to the non-deterministic transaction restart-
there are changes in gas values in the newer versions. This ing mechanism. The lower throughput, when compared to
requires to increase the gas value to 0x10000 in order to Xeon, is due to the compute-intensive PoW consensus proto-
run YCSB benchmark on v1.8.15. Second, a transaction col. In fact, the power usage of Ethereum is 2× higher than
may be restarted multiple times in v1.8.15, as discussed in Hyperledger and Parity on TX2. Specifically, 6 TX2 nodes
Section 4.2. use, on average, 37.9W, 19.8W and 17.4W when running
To understand the second factor contributing to the low Ethereum, Hyperledger and Parity, respectively.
throughput, we profile the code to record the number of The latency of Ethereum increases significantly on TX2,
times an unique transaction, as represented by its hash, is from 46.7s on 2 nodes to 195.6s on 8 nodes. This is 4.5×
restarted, or applied. Even though the average number of higher than the latency on 8 Xeon nodes. On the other
11
1600 320 25 250
Throughput Throughput
1400 Power 280 Power
20 200
Throughput [tps]
Throughput [tps]
1200 240
Power [W]
Power [W]
1000 200 15 150
800 160
600 120 10 100
400 80
5 50
200 40
0 0 0 0
Xeon Xeon+TX2 TX2 TX2+RP3 Xeon Xeon+TX2 TX2 TX2+RP3
(a) Hyperledger (b) Parity
hand, Parity’s latency decreases with the number of nodes: We found that higher-end wimpy nodes achieve reasonable
from 87.4s on 2 nodes to 46.7s on 8 nodes. In summary, performance with significantly lower energy than brawny
Ethereum is virtually unusable on wimpy systems due to (i) nodes. In particular, a Jetson TX2 cluster with eight nodes
low throughput and high latency caused by PoW consensus, achieves more than 80% and almost 30% of Parity and Hy-
and (ii) unstable performance due to transaction restarting. perledger throughput, respectively, while using 18× and 23×
less power, respectively, than an 8-node Xeon cluster.
5.2 Heterogeneous Cluster We also found that wimpy nodes with well-balanced PPR
In this section we examine the effects of heterogeneous achieve higher energy efficiency compared to extremely low-
nodes on the overall blockchain performance. The baselines power nodes. For example, a TX2 is more energy efficient
of homogeneous clusters are represented by (i) 4 Xeon nodes than a Raspberry Pi 3, even though the former has an idle
and (ii) 4 TX2 nodes. From the homogeneous Xeon cluster, power of 2.4W and a peak power of more than 10W, while
we replace two nodes with TX2 (Xeon+TX2); from the the latter has 2W and 5W, respectively. The better energy
homogeneous TX2 cluster we replace two nodes with RP3 efficiency of TX2 compared to RP3 is due to its higher per-
(TX2+RP3). We run the distributed benchmarks for Hy- formance while keeping a low power profile at subsystem
perledger and Parity. Ethereum is left out because it cannot level, including the CPU, memory and storage.
be run on RP3. Finally, we found that recent versions of Ethereum suffer
As shown in Figure 9 for the peak throughput of YCSB, from low and unstable performance. It is due to the trans-
the performance degrades when lower-performance nodes action restarting mechanism that stops and discards trans-
are introduced. But the power consumption improves be- action execution whenever new blocks are received, even if
cause the heterogeneous cluster uses less power. In partic- those blocks are empty. This fact, together with the high
ular, Xeon+TX2 has a performance drop of 35% but uses cost of the PoW consensus protocol, make Ethereum unus-
53% less power than the homogeneous Xeon cluster, when able on wimpy nodes.
running Hyperledger. The results are better for Parity,
where a 43% power savings causes only 10% loss of through- 7. REFERENCES
put. However, adding RP3 nodes to a TX2 cluster does [1] Raspberry Pi 3 Model B. http://bit.ly/1WTq1N4,
not yield satisfactory results. For Hyperledger, the through- 2016.
put drops 62%, while the power decreases only slightly from [2] Intel NUC Kit NUC7i3BNH.
13.4W to 11.8W (or only 12% power savings). For Parity, http://www.webcitation.org/74GPLkSah, 2017.
the power consumption of the heterogeneous cluster is even [3] Ubuntu Docs - What is swappiness and how do I
higher than the homogeneous cluster, 12.8W versus 12W, change it? http://www.webcitation.org/76QeVALC9,
while the throughput drops from 15.3 tps to 11.5 tps. 2018.
Similar to the analysis of homogeneous clusters, the re- [4] ARM. ARM Announces Support For EEMBC
sults here demonstrate that higher-end wimpy nodes have CoreMark Benchmark.
the potential of reducing power usage, while achieving rea-
http://www.webcitation.org/6RPwNECop, 2009.
sonable performance. However, heterogeneous clusters with
[5] L. A. Barroso, J. Clidaras, and U. Hoelzle. The
wimpy nodes may not always achieve the best PPR. More
Datacenter as a Computer: An Introduction to the
specifically, if the performance gap between different types
Design of Warehouse-Scale Machines, Second Edition.
of nodes is too large, the low-power profile of the wimpy
Morgan and Claypool Publishers, 1st edition, 2013.
nodes does not lead to better energy efficiency due to lower
throughput and increasing latency. [6] T. Blummer. An Introduction to Hyperledger. 2018.
[7] V. Buterin. A Next-Generation Smart Contract and
Decentralized Application Platform. 2013.
6. CONCLUSIONS [8] M. Castro and B. Liskov. Practical byzantine fault
In this paper, we performed an extensive time-energy anal- tolerance. In Proceedings of the Third Symposium on
ysis of representative blockchain workloads on low-power, Operating Systems Design and Implementation, OSDI
wimpy nodes in comparison with traditional brawny nodes. ’99, pages 173–186, Berkeley, CA, USA, 1999.
The wimpy nodes used in our analysis cover the low-end USENIX Association.
and high-end performance spectrum, and both x86/64 and [9] Digiconomist. Bitcoin Energy Consumption Index.
ARM architectures. http://www.webcitation.org/74GL5jBxg, 2018.
12
[10] Digiconomist. Ethereum Energy Consumption Index S. Thajchayapong. Performance Analysis of Private
(beta). http://www.webcitation.org/74GLngHMZ, Blockchain Platforms in Varying Workloads. In Proc.
2018. of 26th International Conference on Computer
[11] T. T. A. Dinh, J. Wang, G. Chen, R. Liu, B. C. Ooi, Communication and Networks, pages 1–6, 2017.
and K.-L. Tan. BLOCKBENCH: A Framework for [20] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, and
Analyzing Private Blockchains. In Proc. of 2017 ACM A. Ramirez. The Low Power Architecture Approach
International Conference on Management of Data, Towards Exascale Computing. Journal of
pages 1085–1100, 2017. Computational Science, 4(6):439–443, 2013.
[12] P. T. Documentation. Getting Synced. [21] S. Sankaran, S. Sanju, and K. Achuthan. Towards
https://wiki.parity.io/Getting-Synced, 2019. Realistic Energy Profiling of Blockchains for Securing
[13] EtherWorld. Understanding Ethereum Light Node. Internet of Things. In Proc. of 38th IEEE
http://www.webcitation.org/77lSRvuey, 2018. International Conference on Distributed Computing
[14] D. Franklin. NVIDIA Jetson TX2 Delivers Twice the Systems, pages 1454–1459, 2018.
Intelligence to the Edge. [22] C. Staelin and L. McVoy. lmbench - system
http://www.webcitation.org/73M0i1pIf, 2017. benchmarks. http://www.webcitation.org/74EthsKEa,
[15] V. Gupta and K. Schwan. Brawny vs. Wimpy: 2018.
Evaluation and Analysis of Modern Workloads on [23] K. Suankaewmanee, D. T. Hoang, D. Niyato,
Heterogeneous Processors. In Proc. of 27th S. Sawadsitang, P. Wang, and Z. Han. Performance
International Symposium on Parallel and Distributed Analysis and Application of Mobile Blockchain. In
Processing Workshops and PhD Forum, pages 74–83, Proc. of International Conference on Computing,
2013. Networking and Communications, pages 642–646,
[16] S. Han, Z. Xu, and L. Chen. Jupiter: A Blockchain 2018.
Platform for Mobile Devices. In Proc. of 34th IEEE [24] B. M. Tudor, Y. M. Teo, and S. See. Understanding
International Conference on Data Engineering Off-Chip Memory Contention of Parallel Programs in
(ICDE), pages 1649–1652, 2018. Multicore Systems. In Proc. of International
[17] S. Nakamoto. Bitcoin: A Peer-to-peer Electronic Cash Conference on Parallel Processing, pages 602–611,
System. 2008. 2011.
[18] Parity.io. Blockchain Infrastructure for the [25] L. van Doorn. Enabling Cloud Workloads Through
Decentralised Web. 2018. Innovations in Silicon.
[19] S. Pongnumkul, C. Siripanpornchana, and http://www.webcitation.org/6t33R0NZg, 2017.
13
400
W/O Explicit GC Invocation, Swappiness=60
W/O Explicit GC Invocation, Swappiness=10
With Explicit GC Invocation
APPENDIX
350
A. ADDITIONAL RESULTS
300
Hyperledger on RP3. Figure 10 compares three dif-
Memory Used [MB]
250 ferent runs of Hyperledger with and without explicitly call-
200 ing Go’s garbage collector on RP3. Figure 10a represents
the same execution plotted in detail in Figure 3. In almost
150
all cases, Hyperledger with explicit GC invocation uses less
100 memory and is as fast, if not faster, than Hyperledger with-
50
out explicit GC invocation. On the one hand, the GC incurs
CPUHeavy Deploy IOHeavy Deploy CPUHeavy
more mmap/munmap system calls. On average across our ex-
0
0 50 100 150 200 250 300
periments, Hyperledger with explicit GC incurs 70 mmap and
Timeline [s] 4 munmap calls, respectively, while Hyperledger without GC
invocation incurs 50 mmap and 2 munmap calls, respectively.
(a) Run 1
400 Ethereum. Figure 11 compares the execution time and
W/O Explicit GC Invocation, Swappiness=60
W/O Explicit GC Invocation, Swappiness=10
With Explicit GC Invocation
power usage of different runs of the same CPUHeavy work-
350
load on Ethereum with four miner threads, when running
300 on the NUC node. We observe significant execution time
differences, while the power is roughly constant at around
Memory Used [MB]
250
23W. Compared to the idle power of 9W and the CoreMark
200 power of 18.6W, Ethereum’s power usage is higher, suggest-
150
ing that the system is doing heavy work not only at CPU
level, but also at memory and I/O.
100 Table 7 compares the number of times transactions are
50 CPUHeavy Deploy IOHeavy Deploy CPUHeavy
re-started (applied) in two versions of Ethereum on a clus-
ter setup with varying request rate. We present the min-
0
0 50 100 150 200 250 300 imum, maximum and average times, with standard devia-
Timeline [s] tion, across all unique transactions. We also show how many
(b) Run 2 unique transactions are executed and the total number of
400
W/O Explicit GC Invocation, Swappiness=60
times ApplyTransaction() method is called.
W/O Explicit GC Invocation, Swappiness=10
With Explicit GC Invocation
Single-node Time-Power-Energy. Execution time,
350
power usage and total energy of CPUHeavy and IOHeavy
300 workloads are plotted in Figures 12, 13 and 14 for Hyper-
Memory Used [MB]
250
ledger, Ethereum and Parity, respectively.
Cluster Performance. The throughput, latency and
200 power usage of Smallbank and Donothing workloads at clus-
150 ter level are plotted in Figures 15, 16, 17, 18, 19 and 20. Fig-
ures 15 and 16 reflect the performance with varying trans-
100
action request rate for Smallbank and Donothing, respec-
50 CPUHeavy Deploy IOHeavy Deploy CPUHeavy
tively. Figures 17, 18 show the performance on increasing
number of nodes. Figures 19 and 20 show the performance
0
0 50 100 150 200 of Smallbank and Donothing, respectively, on heterogeneous
Timeline [s] clusters, as discussed in Section 5.2.
(c) Run 3
2000 20 Min 8 1 1 1 1 1
1500 15 Max 71 59 92 106 105 143
Average 29.7 27.8 22.2 19.1 14.9 13.0
1000 10 v1.4.18
Std.dev. 9.8 7.6 14.4 14.6 14.3 11.8
Unique 19,531 38,832 74,535 122,500 125,093 126,455
500 5 Total 580,507 1,080,921 1,653,870 2,342,661 1,863,023 1,641,687
0 0 Min 1 1 1 1 1 1
1 2 3 4 5 Max 528 385 497 246 183 114
Average 21.0 18.4 20.8 18.8 19.4 21.1
Run v1.8.15
Std.dev. 17.4 14.9 16.3 14.5 14.5 18.4
Unique 10,935 12,289 10,713 10,957 10,826 10,725
Figure 11: Difference in Ethereum CPUHeavy execution Total 229,754 226,488 222,503 206,357 209,560 226,455
14
63891.3
Xeon 34274.7
TX2 8915.3
10960.3
10 RP3
8.3
10000
Time [s]
14675.3 6191.7 5768.3
11911.7
3.9
2.8 2871.4 2966.1
2.4
1.7
2.5
10000 7365.3 7025.0
1442.0 1487.1
1.1 1.2 1.2
1.0 1.0 4299.0
1 3473.2
1000 744.1
2125.1
1721.1
1055.2
1000
83.1 83.0 83.0 83.7 83.6 83.6
100 50.6 52.0
72.9
100 100
Power [W]
1 1 1
356701.8
206.1
247970.9
Energy [J]
176362.8 193297.4
100 51.4
60.4 58.5
50.7 126014.8 124318.4
38.4 106795.5 111009.4
87662.7 85440.6 86898.0
100000 62201.5
65819.5 100000 62252.2
55836.2
76359.0
13.6
9.4 43408.5
38892.3 37784.2
10 5.0
6.4 6.7 30750.2 27299.8 28470.9
21749.4
18903.8
2.6
1
1M 10M 100M 3.2M 6.4M 12.8M 3.2M 6.4M 12.8M
Input Size Input Size Input Size
TX2
Time [s]
10 12.4 12.6
15.8 14.3
8.4 8.5
5.7 5.4
4.9 4.9 4.6
10 10 2.8
4.5 4.5
1 1
81.4 80.5 77.9 78.5 78.6
100 80.9
100 76.7
100
Power [W]
10 5.0
10 4.8 5.1 5.2
10 5.0 4.7 5.0
1 1 1
100000 26656.5 100000
1000 811.5 849.4
Energy [J]
201.6
1000 220.9
360.6 348.2
155.7
135.4
80.4 78.9 89.3 76.0 71.4
10 100
59.4
100 41.8
1 10 10
1M 100 1000 10000 100 1000 10000
Input Size Input Size Input Size
Xeon 2999.2
NUC 12267.3
TX2 192.3
RP3 1000
Time [s]
149.5
115.3
1245.7 1421.4
289.6 85.2
1000 537.0
744.3
171.1 187.5 100 63.8
53.2 53.3
74.9
53.3
125.4
262.9 96.6 108.0 43.0
102.2 103.1
100 85.5
64.8 32.1
100 42.7
32.2
100 57.6
71.5
100 50.7 52.8
71.0
100 50.0 50.1 51.0
Power [W]
14.7 14.2
12.7
10.5
8.8 8.8 8.7 9.5
10 10 10
4.5
3.6
3.0
2.6 2.7 2.5 2.5 2.7 2.5 2.5
2.4
2.1 1.9 2.0
1 38337.3 33310.2
1 1
3193.6
2706.6
10910.4 2160.5
10000 5846.4
3269.6
6366.6
10000 9021.3 8904.1 8104.4 1298.6
1832.6
Energy [J]
4330.5
1311.2 2671.6 1000
1000 934.0
1013.3
465.8
288.5
1000 734.6
163.9 149.0
100 376.1 321.8 130.4
10 100 80.4
1
1M 10M 100 1000 10000 100 1000 10000
Input Size Input Size Input Size
15
300
Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger
Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum
Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity
250
1000 1000
Latency [s]
100 150
100
100
10
50
1 0 10
10 100 1000 10 100 1000 10 100 1000
Transaction rate [tps] Transaction rate [tps] Transaction rate [tps]
Figure 15: The performance of Smallbank benchmark with increasing transaction rate
300
Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger
Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum
Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity
250
1000 1000
Latency [s]
100 150
100
100
10
50
1 0 10
10 100 1000 10 100 1000 10 100 1000
Transaction rate [tps] Transaction rate [tps] Transaction rate [tps]
Figure 16: The performance of Donothing benchmark with increasing transaction rate
Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger
Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum
Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity
1000 150
Latency [s]
100
100 100
10
10 50
1 0 1
2 4 6 8 2 4 6 8 2 4 6 8
Nodes Nodes Nodes
(a) Throughput (b) Latency (c) Power
Figure 17: The performance of Smallbank benchmark with increasing number of nodes
16
Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger Xeon, Hyperledger TX2, Hyperledger
Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum Xeon, Ethereum TX2, Ethereum
Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity Xeon, Parity TX2, Parity
150
Latency [s]
1000
100
100 100
10
10 50
1 0 1
2 4 6 8 2 4 6 8 2 4 6 8
Nodes Nodes Nodes
(a) Throughput (b) Latency (c) Power
Figure 18: The performance of Donothing benchmark with increasing number of nodes
Throughput [tps]
1200 240
Power [W]
Power [W]
1000 200 15 150
800 160
600 120 10 100
400 80
5 50
200 40
0 0 0 0
Xeon Xeon+TX2 TX2 TX2+RP3 Xeon Xeon+TX2 TX2 TX2+RP3
(a) Hyperledger (b) Parity
Throughput [tps]
1200 240
Power [W]
Power [W]
17