CSD 20.5 4G Monitoring and Troubleshooting Guide
CSD 20.5 4G Monitoring and Troubleshooting Guide
Release 20.5
This document is intended for use by Nokia's customers (“You”) only, and it may not be used except for
the purposes defined in the agreement between You and Nokia (“Agreement”) under which this document
is distributed. No part of this document may be used, copied, reproduced, modified or transmitted in any
form or means without the prior written permission of Nokia. If you have not entered into an Agreement
applicable to the Product, or if that Agreement has expired or has been terminated, You may not use
this document in any manner and You are obliged to return it to Nokia and destroy or delete any copies
thereof.
The document has been prepared to be used by professional and properly trained personnel, and
You assume full responsibility when using it. Nokia welcome Your comments as part of the process of
continuous development and improvement of the documentation.
This document and its contents are provided as a convenience to You. Any information or statements
concerning the suitability, capacity, fitness for purpose or performance of the Product are given solely
on an “as is” and “as available” basis in this document, and Nokia reserves the right to change any such
information and statements without notice. Nokia has made all reasonable efforts to ensure that the
content of this document is adequate and free of material errors and omissions, and Nokia will correct
errors that You identify in this document. But, Nokia' total liability for any errors in the document is strictly
limited to the correction of such error(s). Nokia does not warrant that the use of the software in the Product
will be uninterrupted or error-free.
NO WA RR AN TY O F AN Y K I ND , EI T H ER EX PR E S S O R I M P L I E D , I N C L U D I N G B U T
NOT LI M I T ED T O AN Y W AR RA NTY O F AV AI L A B I L I T Y , A C C U R A C Y , R E L I A B I L I T Y ,
TITL E, NO N- I NFR I NG E M EN T, M ER CH AN T AB I L I T Y O R F I T N E S S F O R A P A R T I C U L A R
PURPO S E, I S M AD E I N RE LATI O N T O T HE CO N T E N T O F T H I S D O C U M E N T . I N N O
EVENT W I LL NO KI A BE LI AB LE FO R AN Y D AM A G E S , I N C L U D I N G B U T N O T L I M I T E D
TO SPECI AL, D I RE CT, I N DI R EC T, I NC I DE NT A L O R C O N S E Q U E N T I A L O R A N Y
L OSSES, SUCH AS BU T NO T LI M I T E D T O L O S S O F P R O F I T , R E V E N U E , B U S I N E S S
INTERRUPT I O N, B US I NE SS O PP O RT U NI T Y OR D A T A T H A T M A Y A R I S E F R O M T H E
USE O F TH I S D O CU M EN T O R TH E I N F O RM A T I O N I N I T , E V E N I N T H E C A S E O F
ERRO RS I N O R O M I SS I O NS FRO M T H I S D O CU M E N T O R I T S C O N T E N T .
This document is Nokia’ proprietary and confidential information, which may not be distributed or disclosed
to any third parties without the prior written consent of Nokia.
Nokia is a registered trademark of Nokia Corporation. Other product names mentioned in this document
may be trademarks of their respective owners, and they are mentioned for identification purposes only.
This product may present safety risks due to laser, electricity, heat, and other sources of danger.
Only trained and qualified personnel may install, operate, maintain or otherwise handle this product and
only after having carefully read the safety information applicable to this product.
The safety information is provided in the Safety Information section in the “Legal, Safety and
Environmental Information” part of this document or documentation set.
Nokia is continually striving to reduce the adverse environmental effects of its products and services. We
would like to encourage you as our customers and users to join us in working towards a cleaner, safer
environment. Please recycle product packaging and follow the recommendations for power use and proper
disposal of our products and their components.
If you should have questions regarding our Environmental Policy or any of the environmental services we
offer, please contact us at Nokia for any additional information.
Monitoring and Troubleshooting Guide
Contents
1 About this document...................................................................................................................................... 7
1.1 Reason for new issue.............................................................................................................................. 7
1.2 Intended Audience....................................................................................................................................8
1.3 Conventions used..................................................................................................................................... 8
1.4 Related documentation............................................................................................................................. 8
1.5 Document support.................................................................................................................................... 9
1.6 Technical support..................................................................................................................................... 9
1.7 How to order.............................................................................................................................................9
1.8 How to comment...................................................................................................................................... 9
2 Troubleshooting............................................................................................................................................ 10
2.1 Troubleshooting CBAM...........................................................................................................................10
2.2 Troubleshooting during Installation.........................................................................................................10
2.3 Troubleshooting during Upgrade............................................................................................................ 12
2.4 Troubleshooting Post Upgrade...............................................................................................................13
2.4.1 Troubleshooting during LCM operations........................................................................................ 13
2.5 Troubleshoot Service Manager.............................................................................................................. 15
2.5.1 SM GUI failure scenarios...............................................................................................................16
2.5.2 ME import failure scenarios........................................................................................................... 16
2.6 Troubleshooting CSD............................................................................................................................. 18
2.6.1 Application issues and errors.........................................................................................................18
2.6.2 To troubleshoot healthMachines or vnfcMapList........................................................................... 19
2.6.3 To troubleshoot issues related to CSD application processes.......................................................20
2.6.4 To troubleshoot an interface connection........................................................................................20
2.6.5 To troubleshoot Diameter peer and route issues.......................................................................... 21
2.6.6 To manually disconnect a peer......................................................................................................24
2.6.7 Routing and Peering...................................................................................................................... 24
2.6.8 Rules Engine.................................................................................................................................. 25
2.6.9 Errors seen during CBAM operations (scale-in, scale-out, rebuild)............................................... 25
2.6.10 Compute-Host Restart..................................................................................................................33
2.6.10.1 To stop or start VMs during maintenance window.............................................................. 33
2.6.11 Error: VNF package upload failed, Repository for create exists already: repository: ddebvnf..... 34
2.6.12 Graceful reboot on redundant nodes........................................................................................... 34
2.6.13 Redundant nodes become MASTER at the same time............................................................... 34
2.6.14 To troubleshoot ETCD Cluster Service Degradation alarm......................................................... 35
2.6.15 Clearing prometheus data............................................................................................................37
2.6.16 Troubleshooting call failures........................................................................................................ 38
2.6.17 Checking for memory leaks in ASR process............................................................................... 39
2.7 Troubleshooting Grafana........................................................................................................................ 40
2.8 Troubleshooting CSD Analytics..............................................................................................................41
2.9 CSD error messages.............................................................................................................................. 41
2.10 Tracing and Debugging...................................................................................................................... 144
2.11 CSD remote logging with rsyslog....................................................................................................... 146
3
Monitoring and Troubleshooting Guide
4
Monitoring and Troubleshooting Guide
4 Alarms.......................................................................................................................................................... 179
4.1 CSD alarms.......................................................................................................................................... 179
4.2 SM Application alarms..........................................................................................................................185
4.3 Platform alarm details for CSD and SM...............................................................................................188
6 Network Troubleshooting...........................................................................................................................204
5
Monitoring and Troubleshooting Guide
7.2.17 Improper disk utilization of all the VMs during OAM node reboot..............................................211
7.2.18 External SLF timeout value........................................................................................................211
7.2.19 Changes in SCTP Association Profile....................................................................................... 211
7.2.20 Changes in Peer connection status dashboard......................................................................... 212
7.2.21 Changes in CSD and SM internal communication matrices...................................................... 213
7.2.22 Database internal IP address.....................................................................................................216
7.2.23 VMware Changes....................................................................................................................... 216
7.2.24 Changes in Diameter Peer form................................................................................................ 216
7.2.25 Changes in Peer Management Dashboard................................................................................217
7.2.26 Changes in SLF Identity to server pool form............................................................................. 217
7.2.27 Changes in SLF Lookup Table Form.........................................................................................217
7.2.28 SCTP link overload handling mechanism is introduced.............................................................218
7.2.29 Changes in Grafana................................................................................................................... 218
7.2.30 Changes in output of listDiameterPeers.sh................................................................................218
7.2.31 Change in alarm log storage path............................................................................................. 219
7.2.32 Change in alarm structure..........................................................................................................219
7.2.33 Changes in Peer disconnect detected alarm............................................................................. 220
7.2.34 Changes in Brevity control for Peer disconnect detected alarm................................................ 221
7.2.35 MariaDB and Zabbix alarms are deprecated............................................................................. 221
7.3 Troubleshooting and debugging commands........................................................................................ 221
6
Monitoring and Troubleshooting Guide About this document
• Minor updates on
Troubleshooting and
debugging commands.
• Changes in message
filter criteria: lists all the
unsupported criteria
removed from the message
filters in the SM GUI.
- Provisioned
- Partial Provision
- Failed
- New
- Delete Attempted
• Service Provisioners - people who provision the CSD data using the Service Manager.
• Service Administrators - people who monitor and troubleshoot the CSD.
• Field Support Personnel - people who will install the CSD.
graphical user interface text Text that is displayed in a graphical user interface
• CSD - Release 20.5 Installation and Upgrade Operations Guide for Cloud Deployments
• CSD - Release 20.5 User Guide
2 Troubleshooting
This chapter describes how to manage logging functions and troubleshoot common problems that
arises in the CSD.
CBAM troubleshooting
For details on CBAM troubleshooting, refer the CloudBand Application Manager CBAM 19.5 SP1
(v19.5.1), CloudBand Application Manager Troubleshooting Guide, DN09247262 from Discovery
Center.
Ensure that the NTP servers are configured and reachable in respective CSAR packages.
Problem Solution
Service unavailable alarm is notified after instantiation. Verify that the service is already up on the relevant
VM. If the status is yes, then clear the alarm manually.
Upon stack-creation failure, if the graceful termination This occurs only when all stack creation fails. Only
fails with the following error: forceful termination must be made upon stack creation
failure.
• Status: 500
• Detail: 'NoneType' object is not iterable
If scaling of the VNF fails with the following error: Add the resources on to the cloud.
WorkflowHeatOperationError:
Stack operation error! Stack id:
33909b8e-4ddd-4070-b67b-2976fa6cf8ca,
expected status: COMPLETE, actual
id9YZ-09148-MT11-PCZZA © 2020 Nokia 10
2.0
Monitoring and Troubleshooting Guide Troubleshooting
Problem Solution
If CBAM throws the following error during instantiation Follow the steps listed here to troubleshoot the issue:
of VNF:
1. Log in to CBAM as root
ERROR: AuthorizationFailure: 2. Copy the certificate of the vlab to /etc/pki/ca-trust/
Authorization Failed: SSL exception source/anchors/.
connecting to https://10.193.134. 3. Run the command update-ca-trust.
100:13000/v2.0/tokens: ("bad handshake:
Error([('SSL routines', 'ssl3_get_
server_certificate', 'certificate verify
failed')],)",)
If installation or upgrade fails with the following error: Configure the retries=3 in the CBAMs ansible.
cfg configuration file and perform the operation again.
*UNREACHABLE! => {\"changed\": false,
This enables the CBAM to try for a maximum of three
\"msg\": \"Failed to connect to the
times, before declaring the ansible operation as failed.
host via ssh.\", \"unreachable\": true}
\nfatal: [41dff769-3f6c-4dcd-aac4- For example, following is the sample output.
d5e4f61ce 587]: *
[defaults]
#remote_tmp = /tmp/ansible_tmp
# gathering = explicit
retries=3
The Imagesu operation in CBAM fails due to the following underlying infrastructure and connectivity
errors with CBAM:
JavaScriptAction failed error: Ensure that the set of OpenStack and VMware.
instances are separated
JavaScriptAction failed: Invalid node
by a single space.
not part of VNF!!!!
Order of command execution during failure of LCM operation due to cluster unavailability
source /etc/etcd/etcd.conf
To remove a member:
To add a member:
source /etc/etcd/etcd.client.conf
etcdctl --endpoints ${ETCDCTL_ENDPOINT} cluster-health
3. Remove the unreachable member based on the identified unreachable member ID identified
previously by running the following command.
If all the members have no leader, then wait until one member is a leader.
5. Scale-out operation.
If a cluster is healthy and one member is in a leader state, then execute the scale-out operation.
1. Log in to the unreachable member node and add the member to the cluster.
source /etc/etcd/etcd.client.conf
source /etc/etcd/etcd.conf
etcdctl --endpoints ${ETCDCTL_ENDPOINT}
member add ${ETCD_NAME} ${ETCD_LISTEN_PEER_URLS}
2. Remove the
etcd
data.
rm -rf ${ETCD_DATA_DIR}/*
3. In the /etc/etc/etcd.conf file modify the cluster state from new to existing.
• When the Error! Couldn't fetch from SM error is thrown in the SM GUI.
Solution: Identify the reason for outage of database and execute the following command on OAM
node as a root user.
AsControl start db
Note: If fall back operation is performed after upgrading only database nodes, you need to
stop or start the standby OAM node in the VNF.
1. Import operation fails, if any primary field information is missing or an incorrect field information
described in the ME import request.
2. Import operation fails, if a duplicate provisioning ME record exists and the SM user forgets to
select the Overwrite option during the ME import.
3. Import operation fails, if the associated ME profile does not exist in SM.
Note:
/opt/tpa/logs/ImportExport.log
• If distribution on the fly fails, collect and examine the logs at the following file path.
/opt/tpa/logs/Provisioning.log
/opt/tpa/logs/DESMApplication.log
Note:
The features in the SM GUI Configuration > Diameter Peering > Request Timeout Profile
and KPI Statistics are applicable from 18.8 release onwards and not intended for previous
version of MEs.
Ensure that you refrain provisioning them to the previous version MEs:
• In case, it is provisioned to the previous version MEs, user can recover the profile by
cloning it and delete the erroneous profile.
• If the profile deletion fails, try to unprovision from previous version MEs and delete it.
1. Ping the floating IP address or the IP address of the IO node which listens for diameter from
the peer. If the ping is not successful, then check the neutron port status in the VIM and rectify
the same. Additionally, check the external routes on the respective node.
2. Try to ping the IP address of the IO node from peer again, when it is successful, establish the
Diameter connection over TCP/SCTP.
Solution: Ping both the primary and secondary IP addresses. If the ping is successful for
secondary IP address and the ping fails for primary IP address then update the secondary IP
address in the Remote Primary IP / Hostname * field of the Diameter Peer Form and then re-
distribute the diameter peer form.
Note: If an outbound SCTP multi-homed peer is connected and the remote primary and
secondary interfaces goes down and only the secondary interface comes up, then the
SCTP connection will not be established.
a) Check the error message sent in the response, if it is unable to figure out how to
route…, then it is an issue with peers. Take the DDESMApplication.log from APP node, pcap,
and then verify the connectivity and AVPs in the message.
b) Log in to the Service Manager GUI and check the Dashboard tab for Peer connectivity, verify
whether all the peers are in connected state. If not, check the connectivity between the peer(s)
and ensure that all the required peers are connected for the request to process.
c) If the peer to which the request to be sent is not connected and request has the Destination-
Host, then the error is received.
d) If the Destination-Host does not exist, then check the configurations on SM.
1. Check the message filter configuration, whether the criteria is proper and action is triggered
in routing profile and so on.
2. If routing is done with SLF and the lookup failed, check for SLF lookup failures in the
countable events of SM and check for flow counters file in CSD active OAM in path
/opt/tpa/logs/.
a. CSD sends Sh-UDR to external server and Sh-UDA is not received from the External
HSS server because of public-identity value mismatch. Then Countable Event,
SLF_LOOKUP_ FAILURE is triggered for SLF external and the lookup fails.
b. If there is a mismatch between the configured Identity value in Identity To Serverpool
form and Identity value sent by the client , Countable Event SLF_LOOKUP_FAILURE is
triggered for SLF internal and the lookup fails.
SLF_UNREACHABLE: CSD sends UDR to external server and
UDA is received from external HSS server with the Result-Code:
DIAMETER_UNABLE_TO_DELIVER(3002), Countable Event SLF_UNREACHABLE is
triggered.SLF_INDIRECT_MAPPING_FAILURE:If there is mismatch between extracted
mapping key for indirect mapping from returned UDA and configured key in Simple Map
and Set, SLF_INDIRECT_MAPPING_FAILURE is triggered.
1. Check if all the three oam-0, oam-1, db-0 instances of etcd are running fine by running
thecommand systemctl status etcd.
2. Start all the instances using the command systemctl start etcd.
healthMachines provides information about the following processes, other information related to the
processes like PID, start time and so on can be fetched using the following commands:
For DB node:
Note: The diameteragent.cfg file is not automatically replicated. Update the file on both
the active and standby OAM.
Note: If there is more than one network interface on the CSD, specify the required
source interface for the ping command using the -i parameter.
3. Verify that the realm and hostname values on the other system matches with CSD's realm and
host name values in the following lines:
diameteragent.originRealm = realm
diameteragent.originHost =
By default, CSD uses the primary IP address of the primary network interface to communicate
with other systems. The peer IP address can return a single IP address or a list of IP
addresses that are configured for Diameter to listen to and return in CEA messages. The client
source IP address can return a single IP address or a list of IP addresses that are configured
for Diameter to listen to and returns the CEA messages. The list of IP addresses that can
be returned is limited to the configured addresses for the diameteragent.client.src
properties in the diameteragent.cfg file.
5. Verify that the local network and firewall configurations are correct.
7. If the other system is upstream, verify that the configuration on the other system is correctly
configured to communicate with the CSD.
Take the tcpdump on IO on specific ports to troubleshoot further. Also, ensure that the port is added in
the pcap.
$ list-diameter-peers
The system displays information of the local system, followed by any connected peers. For
example:
E2E4 (16777231)
Vendor-specific applications: Gxx (16777266)
S9 (16777267)
Sy (16777302)
SWx (16777265)
Sd (16777303)
Rx (16777236)
Ro / Gy / CreditControl (4) Gx (16777238)
ICCSy (111)
Sh (16777217)
S6a/S6d (16777251)
Cx/Dx (16777216)
Rq (16777222)
S13/S13prime (16777252)
E2E4 (16777231)
Origin-Host: clientHost163.clientRealm163
Origin-Realm: clientRealm163 Connected: true
Connected Address: 135.121.114.188 Quarantined: false
Origin-State-Id: 2
Hostname: 135.121.114.188
Port: 56788 Protocol: TCP IP addresses:
IPV4: 135.121.114.188
IPV6: 0:0:0:0:0:0:0:1
IPV6: fe80:0:0:0:5054:ff:fe41:8d60
Firmware rev: 1
Product name: Alcatel-Lucent 5780 DSC Test Client Vendor ID: ALU
Supported Vendor IDs: ALU,3GPP2,CISCOCSG2,VerizonWireless, 3GPP,ETSI
Inband Security IDs: Accounting applications: Authorization
applications:
Gxx (16777266)
NASREQ (1)
S9 (16777267)
Sy (16777302)
BaseAccounting (3)
SWx (16777265)
Sd (16777303)
Rx (16777236)
Ro / Gy / CreditControl (4) Gx (16777238)
ICCSy (111)
Sh (16777217)
S6a/S6d (16777251)
Cx/Dx (16777216)
Rq (16777222)
S13/S13prime (16777252)
E2E4 (16777231)
Vendor-specific applications: Gxx (16777266)
S9 (16777267)
Sy (16777302)
SWx (16777265)
Sd (16777303)
id9YZ-09148-MT11-PCZZA © 2020 Nokia 23
2.0
Monitoring and Troubleshooting Guide Troubleshooting
Rx (16777236)
Ro / Gy / CreditControl (4) Gx (16777238)
ICCSy (111)
Sh (16777217)
Routes for server 'APP-Node-002': Diameter routes.
Peer: pcrf01-l-nk-lb04c-gx-vzr.vzimspcrf.com
Destination 'vzimspcrf.com', applications {Gx}, priority 2
Host routes.
No routes are applicable.
Routes for server 'APP-Node-003': Diameter routes.
Peer: pcrf01-l-nk-lb04c-gx-vzr.vzimspcrf.com
Destination 'vzimspcrf.com', applications {Gx}, priority 2
Host routes.
No routes are applicable.
5. Select the disconnect cause in the Select Disconnect Cause pop-up window.
6. Click Disconnect.
If routing is not as per the expectation, then check the realm and host based routing configurations on
CSD. Also take the pcap and verify the contents being sent from the inbound peer in the request and
verify whether they are matching with the routing configuration.
• If the criteria to match is correct and whether the action is triggered (Server pool, SLF).
• If the configuration for routing decision is proper and all the peers are connected.
For RSV rules, validate the rules and ensure that no errors are displayed before saving it.
To provision an RSV, it has to be first changed from DRAFT to RELEASE state and then it has to
be provisioned on CSD. Provisioning the RSV in DRAFT state will fail. After the RSV is distributed to
CSD, the user must activate it by clicking Provision on SM GUI and selecting the ME for activation.
Unless the RSV is in ACTIVE state, the rules are not effective on the CSD, they are not applicable,
and the calls are not routed as expected.
Rules Engine overwrites the previous routing-decision, without checking whether the routing-decision
is already taken.
Whereas Routing-profile checks whether routing-decision is already taken, and does not overwrite the
previous routing-decision.
if
DiameterAnswer.Origin-Host = "peer2.nsn.com" and
DiameterAnswer.Result-Code = 3004 and
PeerTableContext.Choose-Peer-By-Origin-Host ( "retry-peer.nsn.com" ) = 1
then
DiameterRequest.Route-To-Chosen-Peer
DiameterRequest trigger is configured as,
LoadBalancerContext.Select-Destination-With-Pool-Name = "Server-Pool1"
Note:
Error message:
Troubleshooting steps:
"4e9a67e4-408d-4271-8f5d-e57dd8fab691"
| 4e9a67e4-408d-4271-8f5d-e57dd8fab691 |
CBAM-501130a1e37a43d281e4264c5e23bc13-db-1 | SHUTOFF |
- | Shutdown |
CBAM-501130a1e37a43d281e4264c5e23bc13-internal_network=192.168.3.
41
2. Boot the VM which is in a shutdown state using the action type hardreboot for HealOne operation,
as shown in the following figure:
Case 2: VNF scale-out operation - scale out failed( UNREACHABLE to SHUTDOWN VM)
Error message:
Troubleshooting steps:
"4e9a67e4-408d-4271-8f5d-e57dd8fab691"|
| 4e9a67e4-408d-4271-8f5d-e57dd8fab691 |
CBAM-501130a1e37a43d281e4264c5e23bc13-db-1 | SHUTOFF |
- | Shutdown |
CBAM-501130a1e37a43d281e4264c5e23bc13-internal_network=192.168.3.
41
2. Boot the VM which is in a shutdown state using the action type hardreboot for HealOne operation,
as shown in the following figure:
Case 3: When you rebuild a VM (A) when VM (B is in SHUTDOWN state) the VM fails
State failed
Operation ID CBAM-3578d0a7b1014951b5eaaddaded92931
Table 4:
Troubleshooting steps:
"4e9a67e4-408d-4271-8f5d-e57dd8fab691"
| 4e9a67e4-408d-4271-8f5d-e57dd8fab691 |
CBAM-501130a1e37a43d281e4264c5e23bc13-db-1 | SHUTOFF |
- | Shutdown |
CBAM-501130a1e37a43d281e4264c5e23bc13-internal_network=192.168.3.
41
3. Boot the VM which is in a shutdown state using the action type hardreboot for HealOne operation,
as shown in the following figure.
Error message:
Troubleshooting method:
In case of VM rebuild failure, check the logs in workflows.log of CloudBand Application Manager
(CBAM). If you observe the connectivity or time-out error, then recover the VM using the following
method.
1. Heal the failed VM again with the action_type as hardreboot and wait for the node to be
accessible.
2. Heal again with action_type as rebuild.
Cause: Scale-out of a VM fails due to insufficient system resources including memory, CPU, and so
on.
Solution: Perform the scale-in operation by specifying ForceScaleIn field as yes. This deletes the VM
where you perform the scale-out operation.
Reason:
If compute-hosts on which CSD VMs have been hosted re-starts or reboots, and if the compute-host
comes up successfully, then the VMs reaches a SHUTOFF state.
Solution or Workaround:
Restart the VMs using openstack-dashboard (or uisng the nova start command).
• All IO VMs
id9YZ-09148-MT11-PCZZA © 2020 Nokia 33
2.0
Monitoring and Troubleshooting Guide Troubleshooting
• All DB VMs
• All OAM VMs
• All DB VMs
• All OAM VMs
• All App VMs
• All IO VMs
2.6.11 Error: VNF package upload failed, Repository for create exists already:
repository: ddebvnf
Reason:
When you use the same CBAM to install two CSD or SM stack, you encounter the error VNF package
upload failed, Repository to create already exist: repository: ddebvnf.
Solution or Workaround:
The descriptor_id in VNFC requires changes or it has to be different for both the stacks.
If the switch over is successful, then proceed to reboot the node with reboot -f command.
3. If the node is in the SLAVE state, then proceed to reboot the node with reboot -f command.
DDE_control become-slave
In scenarios where the redundant nodes cannot communicate between each other due to network
issues or abrupt failure of keepalived service, then both the nodes may become HA Master. This can
be verified by executing the following command as a root user:
ha role
Recovery action:
Log in to both redundant nodes and check the status of keepalived service using the following
command.
If the keepalived service is inactive or stopped on the node, then reboot that node using the command.
reboot -f
The following are the scenarios where you encounter an ETCD Cluster Service Degradation alarm.
If one out of three etcd Check the cluster healthy status One out of three instances are
instances are with down, the
If two out of three etcd instances Check the cluster health status Two out of three instances are
are with down, the
If one etcd instance which was Check the cluster health status One out of three instances are
down with down, the
earlier comes up, and now two etcdctl command. output of the etcdctl
etcd command prompts
etcdctl --endpoints
instances are up and one the information to the user.
http://<ip_
instance is
address>:<port no>
If one etcd instance comes up, Check the cluster health status All three instances are up, the
and now all with output of
the three etcd instances are up, etcdctl command. the etcdctl command
a CLEAR prompts the
etcdctl --endpoints
alarm is raised to clear the information to the user.
http://<ip_
MAJOR alarm.
address>:<port no>
This indicates that all the three
cluster-health
instances
properly.
NOTICE:
If all the etcd instances are up and running and the cluster is healthy, then no alarm is seen.
But when the ETCD Cluster Service Degradation alarm is raised with CRITICAL severity and
when it returns to normal state, then a MAJOR alarm is seen at NetAct which the operator
The following are the points to be noted while configuring the etcd:
For any odd-sized cluster, adding one node always increases the number of nodes necessary for
quorum.
• Fast disks are the most critical factor for etcd deployment which affects performance and stability.
A slow disk increases the etcd request latency and potentially degrade cluster stability. Since
The majority of etcd cluster members must write down every request to disk.
This is required when there is not enough cinder volume on the system for prometheus.
In the following path delete the folders, that are created with older time stamp.
/prometheus/data
The scrape interval and retention time after installation can be changed in the prometheus.yml in
the following path.
/opt/tpa/statistics/prometheus/
/149_shabana_CSD_Destro/config/kpiconfigs/aerospike_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/flow_scrape_interval
{"value": "10"}
/149_shabana_CSD_Destro/config/kpiconfigs/io_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/jmx_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/latency_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/legacy_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/system_scrape_interval
{"value": "5"}
2. Modify the scrape_interval duration of the intended metrics from the cuttent value to intended
value using the following command:
For example, if the current scrape_interval value is 5s and the intended scrape_interval
is 10s:
/149_shabana_CSD_Destro/config/kpiconfigs/aerospike_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/flow_scrape_interval
{"value": "10"}
/149_shabana_CSD_Destro/config/kpiconfigs/io_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/jmx_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/latency_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/legacy_scrape_interval
{"value": "5"}
/149_shabana_CSD_Destro/config/kpiconfigs/system_scrape_interval
{"value": "5"}
• prometheus.service
• grafana.service
• alertmanager.service
Note: If alarms related to TDR failure and Prometheus scrape interval are observed during
heal operation, they can be ignored. Since the heal node is not up with all the prometheus
related process and due to this the scraping can fail and this is an expected behavior.
If any call failures are observed with the error code 3002 (UNABLE_TO_Deliver) originating from
CSD, then it indicates that Diameter connection information may be out of sync between IO and the
Routing nodes. If any of the Routing node is out of sync, then the traffic processed by that node
responds with the error code 3002 (UNABLE_TO_Deliver). Following script is used to identify the
faulty node and take the node out of service for troubleshooting.
The following script is used to check if there are any probable memory leaks in any of the ASR
process running in the following nodes:
positional arguments:
{jstat,gc,all}
optional arguments:
-h, --help show this help message and exit
--threshold THRESHOLD
How Many Consequtive times OU growth is OK
--ouFull OUFULL At what percent OU increase is considered
Threatening!
--gcAllowedDuration GCALLOWEDDURATION
GC Duration more than this value will be flagged
as
RED
--gcSince GCSINCE GC is Monitored for last gcSince minutes
-v, --verbosity
-s, --scheduled Scheduled, script is scheduled for running every
interval
--interval INTERVAL Interval at which a scheduler will run the thread
-l LOGLEVEL, --loglevel LOGLEVEL
Log level
-a, --alarm Alarms enabled if flag set
1. Ensure that after installation, the following are up and running on both OAM nodes.
/prometheus is the new mount created on installation. That is, the size of the cinder volume
specified during installation to store the Prometheus data.
When the advanced metrics profile is attached for longer duration and if the flow metrics generated
exceeds the configured value of Metrics Cardinality limit, then during this time based on the generated
metrics, the following panels may not display any data.
The RC is displayed as zero when the Diameter Result Code is empty in the collected scrapes as the
sample limit is exceeded.
When the jaeger sampling rate at the APP node is more than 100.
The <trace-without-root-span> is displayed, since the root of the trace is usually the last span
to arrive at the collection tier.
The query service fetches the trace and does not return the root span, since the root trace is yet to
arrive.
By default, Jaeger tracing is auto-enabled for Diameter Unable to deliver (3002) and
Diameter too busy (3004) at DEBUG level. Jaeger tracing gets enabled when there are failures
at APP only. It is recommended to enable Jaeger tracing at DEBUG level for troubleshooting purpose.
Note: If the you intend to remove the auto-enabling of Jaeger, then remove the alert from
alerts.default.csd.20.0.yml.j2 and alerts.default.csd.20.0.yml files
persent in /appdata/prometheus/alerts and then restart the prometheus service.
Routing Messages
ROUTING_00001 Invalid routing profile name. The routing profile {0} does not
exist.
ROUTING_00002 Cannot delete routing profile. The routing profile {0} cannot
be deleted as it is being used in
routing plugin configuration.
SLF Errors
SLF_00600 SLF Lookup Table Name must SLF Lookup Table with name
be unique {0} already exists.
Diameter Messages
DDM_00001 Duplicate server pool name The name {0} is not unique.
DDM_00006 Duplicate pool host FQDN Two pool hosts in pool {0} have
the same Diameter FQDN (host:
{1}, realm: {2}).
DDM_00500 Missing filter group name Filter group is missing name for
application plugin {0}.
DDM_00502 Filter criterion with no values Null or empty value has been
set for attribute {0} of context {1}
for filter {2} within filter group {3}
for application plugin {4}.
DDM_00506 Missing message filter name One or more filters for filter
group {0} are missing a name
for application plugin {1}.
DDM_00517 Failed to convert filter criterion Failed to convert value {0} for
attribute parameter value parameter {1} of attribute {2} of
context {3} for filter {4} within
filter group {5} for application
plugin {6}.
DDM_00520 Filter criterion with no values No values have been set for
attribute {0} of context {1} for
filter {2} within filter group {3} for
application plugin {4}.
DDM_00521 Filter criterion unary operator Values not permitted for unary
with values operator {0} for attribute {1} of
context {2} for filter {3} within
filter group {4} for application
plugin {5}.
DDM_00525 Non-existent filter action name Filter action {0} does not exist
within context {1} for filter
{2} within filter group {3} for
application plugin {4}.
DDM_00531 Invalid filter attribute parameter The value {0} is less than lower
value bound {1} for parameter {2} of
attribute {3} of context {4} for
filter {5} within filter group {6} for
application plugin{7}.
DDM_00532 Invalid filter attribute parameter The value {0} is greater than
value upper bound {1} for parameter
{2} of attribute {3} of context {4}
for filter {5} within filter group {6}
for application plugin{7}.
DDM_00533 Invalid filter action parameter The value {0} is less than lower
value bound {1} for parameter {2}
of action {3} of context {4} for
filter {5} within filter group {6} for
application plugin {7}.
DDM_00534 Invalid filter action parameter The value {0} is greater than
value upper bound {1} for parameter
{2} of action {3} of context {4} for
filter {5} within filter group {6} for
application plugin {7}.
DDM_00214 Unsupported system type Plugin {0} does not support the
system type {1}.
DDM_00215 System type not set The system type is not set.
DDM_00304 Generic Binding Key Data must Generic Binding Key Data must
be a numeric value when key be a numeric value when key
type is Long. type is Long; {0} is not correct.
DDM_00311 Key already exists, must be Key label and key data pair
unique. must be unique; key with label
[{0}] and data [{1}] already
exists.
DDM_00313 Too many results returned Too many entries found under
during search. the key criteria provided.
Returned {0} and maximum
allowed during a search is {1}.
Please narrow down your
search criteria.
COMMON_00201 Interval threshold is less than 1. The interval threshold value {0}
cannot be less than 1.
COMMON_00202 Max per interval is less than 1. The maximum per interval value
{0} cannot be less than 1.
Profile Errors
COMMON_00301 Missing name for profile. The profile must have a name.
COMMON_00302 Profile name too long. The profile name is greater than
the maximum length of {0}.
COMMON_00400 Invalid lower threshold value. The CPU usage lower threshold
{0} is less than {1}.
COMMON_00401 Invalid upper threshold value. The CPU usage upper threshold
{0} is less than {1}.
COMMON_00403 Invalid lower threshold value The CPU usage lower threshold
{0} is greater than {1}.
COMMON_00404 Invalid upper threshold value The CPU usage upper threshold
{0} is greater than {1}.
COMMON_00406 Lower threshold exceeds upper The CPU usage lower threshold
threshold. {0} is greater than or equal to
the upper threshold {1}.
COMMON_00407 Resource threshold is less than The CPU usage upper threshold
or equal to upper threshold. {0} is greater than or equal to
the resource threshold {1}.
DIAMETER_00250 Invalid RTO initial value The RTO initial value {0} is not
in the valid range of {1} to {2}.
DIAMETER_00251 Invalid RTO min value The RTO min value {0} is not in
the valid range of {1} to {2}.
DIAMETER_00252 Invalid RTO max value The RTO max value {0} is not in
the valid range of {1} to {2}.
DIAMETER_00253 Invalid max burst value The max burst value {0} is not in
the valid range of {1} to {2}.
DIAMETER_00254 Invalid cookie life value The valid cookie life {0} is not in
the valid range of {1} to {2}.
DIAMETER_00257 Invalid Max Init Retransmission The Max Init Retransmission {0}
value is not in valid range of {1} to {2}.
DIAMETER_00258 Invalid Heartbeat Interval value The Heartbeat Interval {0} is not
in valid range of {1} to {2}.
The following section describes the error messages for sub-component Throttling profile:
The following section describes error messages for sub-component DownstreamOverload Profile:
The following section describes the error messages for sub-component Peerconnection Profile:
The following section describes the error messages for sub-component SCTP Association Profile
The following section describes the error messages for sub-component RequestTimeout Profile:
The following section describes the error messages for sub-component KPI Statistics Profile:
The following section describes the error messages for sub-component Routing Profile:
The following section describes the error messages for sub-component Realm Based Routes:
Please refer :
/opt/tpa/logs/Provisioning.log
Please refer :
/opt/tpa/logs/Provisioning.log
records
DDESMApplication.log for
more details.
records
DDESMApplication.log for
more details.
The following section describes the error messages for sub-component Host Based Routes:
id9YZ-09148-MT11-PCZZA © 2020 Nokia 65
2.0
Monitoring and Troubleshooting Guide Troubleshooting
Please refer :
/opt/tpa/logs/Provisioning.log
Please refer :
/opt/tpa/logs/Provisioning.log
selected ME {0}.
records \
DDESMApplication.log for
more details.
records
DDESMApplication.log for
more details.
The following section describes the error messages for sub-component SS7 Destination:
Not applicable There are more than {0} records There are more than 10000
configured for SS7 Identity To records configured for SS7
Destination List. Please filter Identity To Destination
based on Subscriber Id only. List. Please filter based on
Subscriber Id only.
The following section describes the error messages for sub-component SS7 Peer Configuration:
The following section describes the error messages for subcomponent SS7 System Profile:
MSAPI_00106 Null or Empty Record not Null or Empty Record for SS7
allowed. System Profile : {1} is not
allowed for {0}.
The following section describes the error messages for subcomponent SS7 IWF Configuration:
MSAPI_00106 Null or Empty Record not Null or Empty Record for SS7
allowed. System Profile : {1} is not
allowed for {0}.
SCCPADDRESS Of
SS7IWFConfig.
The following section describes the error messages for Topology hiding configuration:
The following section describes the error messages for sub-component Topology Hiding
Configuration:
The following section describes the error messages for sub-component Home Network Identity:
The following section describes the error messages for Diameter Peer:
DIAMETER_00020 Invalid port number. The value {0} is not in the valid
port number range of \ {1} to {2}.
MSAPI_00069 Invalid INGRESS peer profile Not a valid ingress peer profile
type {0}
MSAPI_00068 Invalid EGRESS peer profile Not a valid egress peer profile
type {0}
The following section describes the error messages for Configuration Entities:
The following section describes the error messages for sub-component Throttling Configuration:
MSAPI_00059 Invalid profile type {0} is not a valid {1} profile type
The following section describes the error messages for sub-component Downstream Overload
Configuration:
The following section describes the error messages for sub-component Import functionality:
The following section describes the error messages for sub-component Export functionality:
Rule Errors
RULE_00089 Rule table "{0}" does not belong Error Reference Number:
to correct rule sets RULE_00089. The rule table
"{0}" is referenced by rule table
"{1}", so rule table "{0}" must
include at least the rule sets of
rule table "{1}".
RULE_00034 Action value "{0}" is too large for Error Reference Number:
attribute "{1}" RULE_00034.The value for
the action is too large for the
attribute.
RULE_00190 The value type "{0}" is not valid Error Reference Number:
for the Concatenate adjustment RULE_00190. For the
operator Concatenate adjustment
operator, the attribute must be
a type supported by lists, for
example String.
RULE_00201 Rule table has invalid action for Error Reference Number:
its associated RuleSets. RULE_00201. Rule table "{0}":
Rule "{1}" with action "{2}" is not
applicable for RuleSets "{3}". A
rule contained in rule table has
an action that is not applicable
to the associated rule sets on
this rule table. All the conflicted
rule sets are listed. Modify the
actions in this rule table or the
selected rule sets to make the
rule table applicable.
Pre-emption Capability or
Pre-emption Vulnerability is
specified.
opt/tpa/logs/DDESMApplication.
log) for more information.
RULE_00390 Failed to load TAC data file "{0}" Error Reference Number:
RULE_00390. Validation error
occurred loading the TAC data
file. See the system log (/opt/
tpa/logs/DDESMApplication.log)
for more information.
GV_00009 Lists must have at least one Error Reference Number: GV_
value. 00009. Lists must have at least
one value.
RULEAPI_ERR_0001 No entry found for Rule System No entry found for Rule System
Version Version. Please check the
requested name ({0})
RULEAPI_ERR_0005 No entry found for Rule System No entry found for Rule System
Version Version.
RULEAPI_ERR_0011 Please check the media type of Please check the media type of
the request, supported media the request, supported media
type is JSON type is JSON
RULEAPI_ERR_0022 No entry found for Rule Group No entry found for Rule Group
({0}) in Rule System Version
({1})
RULEAPI_ERR_0023 Rule Group with the same name exists Rule Group ({0}) with the
already exists same name already exists Rule
System Version ({1})
RULEAPI_ERR_0024 Mismatched enum types in list. A list of enum values may only
contain values from a single
enum type.
RULEAPI_ERR_0029 RSV cannot be deleted from {0} RSV cannot be deleted from {0}
because {1} is/are offline because {1} is/are offline
The following section describes the error messages for Import or Export:
IMPORT_00011 Error parsing rule list for rule Error Reference Number:
table {0}. IMPORT_00011.Cannot parse
list of rules for the rule table.
IMPORT_00016 Error parsing action list for rule Error Reference Number:
{0}. IMPORT_00016.Cannot parse
list of actions for the rule.
IMPORT_00071 Error parsing rule for rule table Error Reference Number:
{0}. IMPORT_00071.Parsing of a
rule failed with the following
error: {1}.
IMPORT_00073 Error parsing action for rule {0}. Error Reference Number:
IMPORT_00073.Parsing of an
action failed with the following
error: {1}.
The following section describes the error messages for Other functionalities:
The following section describes the error messages for sub-component User management:
The following section describes the error messages for sub-component Role Management:
The following section describes the error messages for sub-component SM system configuration:
The following section describes the error messages for subcomponent ME System Configuration
The following section describes the error messages for sub-component Diameter Routing Quality:
The following section describes the error messages for sub-component SLF Lookup Table:
The following section describes the error messages for Identity To Server Pool:
SLF Configuration
The following section describes the error messages for SLF configuration:
The following section describes the error messages for sub-component Server pool:
The following section describes the error messages for sub-component Map and Set:
Not applicable Duplicate value in list Value {0} is duplicated in the list
The following section describes the error messages for sub-component Local diameter configuration:
The following section describes the error messages for Countable event configuration
Table 39:
Database Errors
Provisioning Errors
System Parameters
Trace or Debug should not be enabled on the live node where the traffic is more.
The following details should be shared with the services or support team offline.
/opt/tpa/bin/DDE_logLevel
com.nokia.dde.ddm.common.plugins.PluginMessageServicesAndState DEBUG
/opt/tpa/bin/DDE_logLevel
com.nokia.dde.diameter.common.facade.jdiameter.RouteManagerBase DEBUG
The following table lists the packages for identifying more information on the logs according to
scenarios.
Provisioning PROVLOG SM
For example, while performing import or export operation following is the sample format for command
execution.
• INFO
• DEBUG
• TRACE
The following packages are enabled to test the routing failure scenarios.
/opt/tpa/bin/DDE_logLevel
com.nokia.dde.ddm.common.plugins.PluginMessageServicesAndState DEBUG
/opt/tpa/bin/DDE_logLevel
com.nokia.dde.diameter.common.facade.jdiameter.RouteManagerBase DEBUG
Note: During trace collection, use valid package or class names specified for logLevel set.
As system does not display any errors in case of invalid package or class names specified.
1. Update the /etc/rsyslog.conf file by uncommenting the TCP port line and add the following
template to store the logs.
3. Verify if the rsyslog service is started on the configured port using the following command.
Perform the following steps on both the OAM VMs of CSD or Service Manager:
The following is the sample output after adding the remote server details.
*.* @@10.75.105.123:514
Monitoring
After the VNF is instantiated, monitor the stack creation status through orchestrator (heat stack-list)
and monitor the VM’s status on either dashboard or orchestrator (nova list). To monitor the installation
status or progress, get the vnc console of the VM and open in browser: “nova list; nova get-vnc-
console <ID of VM> novnc”.
Paste the URL in browser to open VM’s console and login as a root user to access logs. All the
installation logs are stored in path /opt/config/log.
1. /var/log/cbam/<logs> (all.log, workflow.log and other logs on CBAM node) - to check the
workflow and installation traces.
2. /opt/tpa/logs/ - on individual nodes to check the system status.
Also check the /var/log/messages and verify the configuration status to check whether the internal
IP assigned by DHCP is configured or not (eth0 interface, 192.x.x.x IP range). If the internal IP is
not configured, then check whether the DHCP discover being sent and offers being received or not.
Installation could fail, if the node does not receive the DHCP offers within the specified time.
If the DHCP offers are received, then the internal IP is being plumbed and localization completes, then
monitor the OAM IP plumb on eth1. Log in to the VM using its OAM IP, after the IP is plumbed.
One of the reason for GrowNode to fail is due to resource shortage, further each log file needs to
check for corresponding errors.
Troubleshooting
To... Do...
Add the appropriate route on diameter nodes. Ensure that the node-specific IP route information
is added in the Route tab of sdfc.conf.xls.
health
For alarms, counters, and notifications Check the logs under directory /var/log/alarms in
active OAM node.
• If the Host node hosting the Active IO fails, call failures are expected till the calls are switched over
to Standby IO. Default switchover time is 8s.
• If the Host node hosting both the Active IO and Standby IO fails, call failures are expected till the
Host and VM becomes available.
• If the Host node hosting Application VMs fails, then the calls that are in process are expected to
fail. There is no impact on new calls or sessions handled by other available Application VMs.
Collecting Aerospike database configuration and log information for offline analysis
On the CSD and SM, when Aerospike database issues require offline analysis, it is often necessary
to collect configuration and log information. This information is spread across the DB nodes in
the system and can be tedious and error prone to collect all the right information from all the right
nodes. Therefore, the CSD and SM provide a tool that collects all the required data and creates a
compressed .tar file on the active OAM. This single file can then be forwarded to Nokia’s technical
support for offline analysis.
Usage:
Where:
Note:
The data or configuration written to a database node is replicated on another database node
in a local site. Thus, if two database nodes have master and replicated copies goes out of
service state, before re-syncing within the cluster then this may result in data or configuration
loss.
Aerospike credentials
This section lists all the credentials related to the aerospike database.
Note: You need to enter the database password while executing the preceding tools. To get
the password, contact the Nokia support team.
2.13.1 To collect Aerospike database configuration and log information for offline
analysis
2. The /BACKUP and the //BACKUP/aerospikeconflogs directory ( ) on the Active OAM enables
storage of the file. Use the following df command to view the available space. For example:
3. Use the db-collect-logs tool to create a compressed tar file on the active OAM, in the
directory you specify.
To see the usage information, type the command name with no arguments:
db-collect-logs
An example command for collecting the needed logs, including the platform logs, with a verbose
option so that a list of collected information is displayed:
db-collect-logs -d /BACKUP/aerospikeconflogs -v -p
4. Find the output file and send it to Nokia technical support. When db-collect-logs tool
completes, the location of the resulting file is displayed:
• dsclocal: All data under this namespace is available only to local site.
• dsc: This namespace is replicated on geo mate, when XDR is configured.
• dscglobal: All data under this namespace is replicated to geo mate and other configured remote
geo sites (when XDR is enabled).
To.. Do..
Check the DB content Use the aql tool provided by Aerospike. Run the aql
as ddeadmin user with the following credentials.
Where:
Check the number of records in any set select * from <name space>.<set name>. This
displays all records and also at the end, displays the
total number of records.
Where:
Example Output:
To.. Do..
-sh-4.2$ /opt/tpa/aerospike/bin/
asinfo -v namespace/dscglobal -l |
grep -i evic type=device evicted-
objects=1103 evict-tenths-pct=5 evict-
hist-buckets=10000 cold-start-evict-
ttl=4294967295 -sh-4.2$
Check sychronisation status The TIMELAG alarm is raised when the geo-
redundancy sites did not synchronize any data with
each other.
Example:
<AlarmCode>53005</AlarmCode>
<AdditionalText>TIMELAG=83</
AdditionalText>
Indicates that the sites have not synced for the past 83
seconds.
Check the DB version Run asinfo command from OAM with -h <DB
Host IP> as option (asinfo -h <DB IP>) and read the
version
Check Logs necessary to diagnose DB behavior The tar files are generated by running the command.
Where:
To.. Do..
Check the DB running status The systemctl status NOKIAasdb tells us whether DB
is up and running or not.
Note:
If select * from <name space>.<set name> query times out in AQL, while retrieving records,
increase the query time out by setting set timeout <milli seconds>. For example, to set the
timeout to 10 seconds, execute
in AQL.
When the network link goes down between The DB connectivity between local and remote
the two geo-redundant sites, CSD raises sites might be not be reachable. Use telnet,
XdrClusterUnreachableAlarm. ping or any other network troubleshooting tools
to verify if the DB nodes between the sites are
reachable.
Application processes
VNF Type Applicable node
running
CSD OAM
• ASR
CSDIO
• ASR
CSDAPP
• ASR
DB
• Aerospike Daemon
SS7IO
• ASR
SM OAM
• ASR
DB
Application processes
VNF Type Applicable node
running
• Aerospike Daemon
DDE_control start
DDE_control stop
<NodeType> is all.
<NodeType> is all.
DDE_status
Attention:
With two Application VMs, two IO, and two OAM nodes, the number of TCP connection
for asd process for each database VM is around 40. So, the count is expected to
increase when the number of Application VMs, IO, and OAM nodes gets increased.
EXAMPLES:
health -h
Run a particular module or suite:
health <module or suite>
Print general help information or detailed descriptive
information for a particular module or suite:
health <module or suite> -h
COMMON MODULE LIST:
health aide_audit [-c <config_file>] [i] [-n <node_id>] health cpu_
usage [-i interval]
health cron_check
health disk_usage
health dns
health equipage_info
health ethernet
health file_check -d <directories_list> -s <file_size> [-p <depth>] [-
n <node_id>]
health jarsign_check [-n <node_id>] [-f <jar_file>]
health last_restart [-t hours]
health ping
health proc_cpu_usage [-p <process_names>] [-] <number_of_top_
processes>] [-n <node_id>]
health proc_cpu_usage [-p <process_names>] [-] <number_of_top_
processes>] [-n <node_id>]
health proc_netstat [-n <node_id>]
health ram_usage
health time [-t sec]
health version [-n <node_id>]
COMMON SUITE LIST:
NONE
NepalDDE-ddeio-1(ACTIVE) OK
NepalDDE-ddeapp-0(NA) OK
NepalDDE-ddeapp-1(NA) OK
NepalDDE-db-0(NA) OK
NepalDDE-db-1(NA) OK
The DB and Applications nodes are not reachable from outside of the Virtual Network Functions
(VNFs), however are accessible only from OAM and IO nodes.
ICMP is enabled on OAM (physical, floating IP) nodes and on IO (physical, floating IP) nodes.
• IO node accepts the Diameter requests from all the peers when Whitelist is not enabled.
When Whitelist is enabled for the peers, the CSD starts accepting the Diameter traffic only from
the White listed peers and the rest are rejected.
• To troubleshoot issues like when peers are not able to connect to CSD, then you need to check
whether the appropriate ports are enabled in the firewall.
For example, to receive diameter messages from peers on port 3868, this port has to be enabled
in the firewall.
Refer to the Diameter routing Chapter in the CSD User Guide for complete details of the Whitelist
feature and of the plugin sequence followed in call processing.
1. Collect the heapdump from the IO and App VMs as a ddeadmin user using the following
command:
Where xyz.bin is the file name which gets generated post successful execution of the command.
Note: The heapdump command triggers a Full GC (FGC) on the JVM and has traffic
impact. It is recommended to use only under R&D request or supervision. Also the cinder
(or path where the file gets stored) must have enough space to collect the heapdump file
(the available free space must be approximately equal to the RAM size of the VM).
2. Verify the connectivity between DB VMs and all other VMs, collect the output of the following
command from the OAM, IO and App VMs.
listDiameterPeers.sh -s
5. Collect the aladmin list from the OAM VM.
6. Collect the tcpdump from the App VM for a very small duration.
7. Collect the output of the following command from all the VMs (These can be collected post
recovery of VM also).
df -k
8. Collect the output of the following command (These can be collected post recovery of VM also).
SaveLogFiles
Monitor the alarms, DDESMApplication logs, sar and top output. If App or IO VM is not responding or
responding with errors due to memory leak or overload, then the VM can be recovered by restarting
the application on the respective VMs.
The following command can be used to find the memory leaks on IO or App VMs. In case of memory
leak, the FGC value grows gradually and OU value gradually reaches to OC value.
For example,
PermitRootLogin yes and then restart the SSH service by executing the following command.
When you require technical assistance to resolve a problem on the CSD, you can expedite the
assistance process by collecting the platform and software information that is listed in the table below.
• CSD release ID
• Platform information, including model, CPU
type, disk configuration and partitioning, and
the amount of installed RAM
System and software logs Collect the required log files in a compressed
archive. View the log files in the /opt/tpa/logs
directory (see To view CSD log files on page
200).
Actions performed before and after the problem Collect the following:
occurred
/opt/tpa/bin/SaveLogFiles
Note: If space in /BACKUP directory is insufficient to collect prometheus data, free some
space in /BACKUP directory and re-run the script.
The following figure describes the SaveLogFiles tool output in /BACKUP directory of active OAM.
All the nodes configuration gets collected in a separate tar file and all the logs gets collected in
separate tar file as shown in the preceding figure.
Example:
Example:
[root@ddebvnf-oame-1 BACKUP]# ls
AppServerConfigFiles.tar.gz LogFiles.tar.gz
The following figure describes the SaveLogFiles tool output in /BACKUP directory of active node.
In Bare-metal, logs of both active and standby node gets collected in a single tar file and configuration
of each node gets collected in a separate tar file as shown in the preceding figure.
Example:
[root@dumpty BACKUP]# ls
DDE_configBackup.dumpty.CSD_18_8_I114.DDESM.20180907065003.tar
DDE_configBackup.humpty.CSD_18_8_I114.DDESM.20180907065003.tar
LogFiles.tar.gz
Reason:
Solution/Workaround
If NTP is not in sync, then restart NTP Daemon using the following command on the VM:
When a database is down the following exception seen in IO node can be ignored.
Reason: The DiameterPeerStatus table is being updated by both the IO nodes. For an outbound
connection on a floating AddressGroup, both the IO nodes try to connect and update the same
record.
During fetch and update of a record, if the other node has updated the same record, then the DB
throws the following error.
StaleStateException
Failed to add DiameterPeerStatus object to DB
com.nokia.dde.common.db.StaleStateException: Stale version for object
com.nokia.dde.diameter.par.DiameterPeerStatus
Reason:
1. Scenario 1: When you click on any object, this error is thrown as tcp ports still try to connect to old
connection because of browser cache.
2. Scenario 2: The preceding error can also be displayed on SM GUI during bulk provisioning when
there is a network delay between SM and CSD OAM.
Solution/Workaround
1. Scenario 1: Refresh the browser or clear the browser cache and then log in to SM GUI.
2. Scenario 2: If the error is due to network delay, then re-provision the entries that are in
unprovisioned state.
Reason:
Workaround:
Give “;” or press Enter after last statement to resolve the issue. And if error continues to exist, then the
rule has to be corrected.
Reason:
ME status update does not happen as still the heart beat is not exchanged between SM and CSD.
Workaround:
Ensure that you wait for minimum 2 minutes before performing any SM GUI operations at Site B.
Reason:
Workaround:
You can ignore the error as this user is added during the installation of CSD.
2.18.7 Error: Result=0001 Package NOKIA config has reported an error. Please
choose one of the following: * 1) Retry 2) Skip Package and Continue Choice:
[1-2,?,n,p,q] retry returning 9
During installation of CSD, you see an error Error: Result=0001 Package NOKIA config
has reported an error. Please choose one of the following: * 1) Retry 2)
Skip Package and Continue Choice: [1-2,?,n,p,q] retry returning 9 and the
installation is shown as successful in CBAM.
Reason:
Workaround:
You can ignore the error as this user is added during the installation of CSD.
Provisioning Failed
Reason:
Workaround:
Reason: On restarting prometheus service, it fails to start thereafter as the data is corrupted.
Workaround: From the following path, select and remove the latest corrupted data folder or file.
Reason: The prometheus service becomes inactive as the prometheus service is disabled and the
preceding error is displayed in the following directory.
/var/log/messages
The following section describes the typical format of empty JSON files.
Workaround: Remove the prometheus data folders containing empty JSON files using the following
command:
rm -rf
systemctl daemon-reload
at com.rabbitmq.client.impl.AMQChannel.
handleCompleteInboundCommand(AMQChannel.java:182)
at com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:114)
at com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.
java:652)
at com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.
java:48)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.
java:599)
at java.base/java.lang.Thread.run(Thread.java:834)
<2019.11.15 21:34:09 912 +0530><D><ddebvnf-ddeapp-1><AMQP Connection 192.
168.3.10:5672>
<com.rabbitmq.client.impl.recovery.AutorecoveringConnection:570>
Connection amqp://[email protected]:5672//csd has recovered
<2019.11.15 21:34:09 917 +0530><D><ddebvnf-ddeapp-1>
<AMQP Connection 192.168.3.10:5672><com.rabbitmq.client.impl.recovery.
AutorecoveringConnection:627>
Channel AMQChannel(amqp://[email protected]:5672//csd,1) has
recovered
Note: These exceptions can be ignored if they are not repeating after bringup.
CAUTION! Do not store anything in /tmp folder as it can cause switchover of redundant
VMs. In case if the /tmp of both the redundant nodes reaches full state, then both nodes will
switchover continuously.
If the /tmp of Active IO or Active OAM reaches a full state, then the switchover is triggered (thereby
restarting the keepalived.service) and written into the ha.log in the following path.
vi /var/log/ha/ha.log
The execution of the following command displays the following error message as shown in the sample
format:
/var/log/messages
Reason:
The objectName is defined as Active-IO for the Peer connection or disconnection alarm. If the
Peer connection, disconnection alarm or the SCTP Address Unavailable Alarm is raised with the
objectName as active IO VM name (For example, ddebvnf-ddeio-[0-1]) then post switchover
of IO VM, the auto clear of alarm fails due to mismatch in the objectName and the alarm must be
cleared manually. To prevent manual clearing of huge number of alarms, the objectName of the IO
VM is hard-coded as Active-IO. This enables the alarm to get auto cleared post switchover of IO
VMs.
If the HA is not stable after heal rebuild of the OAM (in fallback node).
/var/log/ha/ha.log
If HA is toggling due to zabbix-server, then remove the zabbix-server from the ha list for
recovering the setup using the following command.
ha rm zabbix-server
The database failure may result in call failures as the datamigration takes place to distribute the
records to the available databases. Calls fail until the data migration is complete. This is an expected
behavior. The migration time depends on the amount of data and availability of databases.
Solution: Verify if the Datamigration is complete using /var/log/alarms in the CSD OAM node.
Description
To Do
Logging functions
Change the log level Refer to Changing CSD logging levels section in
Logging functions on page 193.
To Do
vi /var/log/ha/ha.log
Backup the DB
Usage:
Where:
Restore the DB
Execute the command [root@sps181 bin]# ./DDE_dbRestore to restore the database from the
given backup files.
Usage:
DDE_dbRestore -d <directory>
Where:
DDE_control become-slave .
To change the snmpd target IP address or any other snmpd configuration change, edit the /etc/
snmp/snmpd.conf configuration on OAM VMs, in a following sequence.
The scrape interval and retention time after installation can be changed in the prometheus.yml in
the following path.
/opt/tpa/statistics/prometheus/
Description
Netmon utility is integrated to check the health of the bonding interfaces. Netmon performs PING
or ARP on the NETMON_MONITOR_IPS configured in the network interface files, and waits for
response. If netmon fails to receive response within NETMON_TIMEOUT time (2seconds), it will
trigger failover.
The netmon is configured in all the bonding interfaces mentioned in the sdc_conf's IPM tab. Netmon
configures the ifcfg files of bonding interfaces with the following options:
NETMON_MONITOR_IPS: External IP Addresses are used for connectivity detection. Multiple IPs
can be provided as a comma-separated list. If multiple IPs are provided, only one needs to pass
connectivity testing to consider network as healthy network.
For more resilience, users can add to monitor static IP of the mate node on the respective bonding
interface with a coma separated value. Along with this NETMON_FAILURE_COUNT can change such
that it is double the number of NETMON_MONITOR_IPS.
Following is an example of netmon configuration in ifcfg files after the preceding changes.
Problem:
Solution:
vi /usr/lib/systemd/system/
2. Edit the following line:
Type=oneshot to Type=simple.
If the Network Eelement object is already added in NetAct and CSD is instantiated later then the
alarms are not visible in NetAct moinitor.
Solution
To overcome this issue, execute the following steps on the NetAct monitor:
Note: CALM ETCD is Unavailable alarm does not get cleared automatically. It must be
cleared manually from the etcd and NetAct.
4 Alarms
Overview
Alarms are required for fault management of the system. Alarms are the way to indicate the operator
when something goes wrong in the system. For example, overload due to message queue, incoming
requests exceeding the throttling threshold for inbound, XDR link goes down between the geo-
redundant site. These are application-specific alarms, platform generates alarms when any node goes
down, unreachable or network error.
Example
The following table lists the CSD application alarm details that are raised by CSD application when a
specific event occurs and notifies to user.
SystemOverload $HOSTNAME processingAlarm performance NA Minor, No It is a notification This alarm is raised $OVERLOAD
Major, used to notify when the host
Degraded COMPONENETS;
customer when that triggers this
Critical
the system gets alarm becomes $USAGESTATE;
overloaded. No overloaded. The $ALARMCODE
action is needed alarm describes
(53001);
if expected, in its payload
otherwise check the the system $ALARMGROUP;
message counts and components that $SPECIFIC
components which is are currently
PROBLEMDETAIL
getting overloaded. overloaded but
does not explicitly
state which of
those components
are critically
overloaded.
SystemOverload $HOSTNAME processingAlarm performance NA Critical No It is a notification This alarm is raised $OVERLOAD
used to notify when the host that
COMPONENETS;
SystemOut $HOSTNAME processingAlarm outOfService NA Major No It is a notification This alarm is raised $ALARM CODE
used to notify when the system
ofService (53003);
customer when that triggers it is
the system is not brought out of $ALARMGROUP;
reachable and Out of service. $SPECIFIC
Service. As an action
PROBLEMDETAIL
user need to restart
the node.
XdrClusterUn $HOSTNAME processingAlarm application NA Major No It is a notification This alarm is raised $SPECIFICCLUSTER;
used to notify when the host
reachableAlarm Subsystem $ALARMCODE
customer when other that triggers this
Failure site went down or alarm detects that (53004);
not reachable. As an XDR cluster is $ALARMGROUP;
an action, customer unreachable.
$SPECIFIC
need to login to
respective node and PROBLEMDETAIL
restart the node.
XdrTimelagAlarm $HOSTNAME processingAlarm application NA Major No It is a notification This alarm is raised $ALARM CODE
used to notify the when the host that
Subsystem (53005);
user that the host triggers this alarm
Failure detects that XDR detects that an $ALARMGROUP;
cluster has too much XDR cluster has $SPECIFIC
lag time. As an too much time lag.
PROBLEMDETAIL
action user need to
check the logs for
node failure.
XdrOutstanding $HOSTNAME processingAlarm application NA Minor No It is a notification This alarm is raised $ALARM CODE
used to notify the when the host
RecordsAlarm Subsystem (53006);
user that the host that triggers this
Failure detects that XDR alarm detects $ALARMGROUP;
cluster has too many that an XDR $SPECIFIC
outstanding records. cluster has too
PROBLEMDETAIL
many outstanding
records.
DataMigration $HOSTNAME processingAlarm unspecifiedReason NA Minor No It is a notification This alarm is raised $ALARM CODE
used to notify when the host that
InProgressAlarm (53007);
customer when the triggers it detects
data migration starts. that data migration $ALARMGROUP;
No action needed if is in progress. $SPECIFIC
it is expected.
PROBLEMDETAIL
XdrDigestlog $HOSTNAME processingAlarm application NA Major No It is a notification This alarm is raised $ALARM CODE
used to notify that when Digestlog
ThresholdAlarm Subsystem (53008);
XDR digestlog has reached
Failure has reached the threshold $ALARMGROUP;
the thrashold percentage $SPECIFIC
percentage. As a
PROBLEMDETAIL
action user needs to
clear the XDR digest
log.
PeerWhiteList $HOSTNAME communication communications NA Minor Yes It is a notification This alarm is raised $ALARM CODE
used to notify when there are
ThresholdAlarm Alarm ProtocolError (51001);
that there is too too many connect
many connection attempts from $ALARMGROUP;
attaempts from unauthorized peers $SPECIFIC
unauthorized peers.
PROBLEMDETAIL
As an action check
contents of the
countable event
"Rejected Peer
Connections" in the
database.
Peer Disconnect $HOSTNAME communication communications NA Major Yes It is a notification This alarm is raised $ORIGINHOST;
used to notify when the diameter
detected Alarm ProtocolError $PROTOCOL;
operator when peer disconnect
the peer gets event is detected $ALARM CODE
disconnected. See and cleared when (51003);
debug logs for more the diameter peer
$ALARMGROUP;
information. connection event
detected with the $SPECIFIC
same diameter PROBLEMDETAIL
peer. Note: If the
diameter peer is
disconnected and
does not reconnect,
then the alarm
must be cleared
manually.
SCTPAddress $HOSTNAME communication communications NA Minor Yes It is a notification This alarm is raised $ALARM
used to notify when the SCTP CODE(51004);
AvailableAlarm Alarm ProtocolError
whenever SCTP address available
$ALARMGROUP;
address available event is detected.
event detected. See $SPECIFIC
debug logs for more PROBLEMDETAIL
information.
SCTPAddress $HOSTNAME communication communications NA Minor Yes It is a notification This alarm is $ALARM CODE
used to notify raised when the
UnavailableAlarm Alarm ProtocolError (51005);
whenever SCTP SCTP address
address unavailable unavailable event is $ALARMGROUP;
event detected. See detected. $SPECIFIC
debug logs for more
PROBLEMDETAIL
information.
RouteFailure $HOSTNAME communication communications NA Minor Yes It is a notification This alarm is raised $ALARM CODE
used to notify the when there are
ThresholdAlarm Alarm ProtocolError (51006);
user that there are Too many routing
too many routing failures for this $ALARMGROUP;
failure for route route config. $SPECIFIC
config. As an action
PROBLEMDETAIL
user need to check
the contents of the
countable event
"Failed Routing
Attempts" in the
database.
SS7Failure $HOSTNAME communication communications NA Major No It is a notification This alarm is raised $ALARM
used to notify the when a SS7 SCTP CODE(55001);
Alarm Alarm ProtocolError
user that the SS7 connection failure
$ALARMGROUP;
sctp connection occurs.
failure occur. $SPECIFIC
PROBLEMDETAIL
MsgValidation $HOSTNAME processingAlarm authentication NA Minor Yes It is a notification This alarm is raised $ALARM CODE
used to notify the when a diameter
Alarm Failure (52001);
user that a diameter message validation
message validation error occurs. $ALARMGROUP;
SLFFailureAlarm $HOSTNAME processingAlarm authentication NA Minor Yes It is a notification This alarm is raised $ALARM CODE
used to notify the when a diameter
Failure (52002);
user that there is SLF lookup failure
a diameter SLF occurs. $ALARMGROUP;
lookup failure occur. $SPECIFIC
As an action check
PROBLEMDETAIL
contents of the
countable event log
in the database.
ThrottlingAlarm $HOSTNAME processingAlarm congestion NA Minor No It is a notification This alarm is $ALARM CODE
used to notify raised when rate-
(52003);
customer when a limiting has been
rate-limiting has applied to diameter $ALARMGROUP;
been applied to the messages. $SPECIFIC
Diameter message.
PROBLEMDETAIL
As an action
evaluate configured
rate-limits and the
number of deployed
processing server
instances.
Both the OAM $HOSTNAME QualityOfService application NA Critical Yes Verify the HA This alarm is raised SPECIFIC
status on both the when both the
VMs are HA Subsystem PROBLEM
redundant VMs using OAM VMs are in
Active Or Standby Failure the command ha same HA states DETAIL=
status. If the ha (Active, Standby, Both the OAM VMs are
status is ACTIVE OOS). HA Active or Stand-by:
on both the VMs, 51999
then verify if the
DDE-APPLICATION#;
VRRP connectivity is
Both the OAM VMs are
broken between both
HA Active or Stand-by.
the VMs. This could
be a communication
failure between the
VMs.
If ha status is
OOS or STANDBY on
both VMs, verify the
logs in /var/log/
ha on the VMs for
possible indications.
Execute the
command ha
enable verbose
on a VM, to view the
detailed logging in /
var/log/ha.
Execute the
command ha
disable verbose
on the VM to disable
the ha traces.
Both the IO $HOSTNAME QualityOfService application NA Critical Yes Verify the HA This alarm is raised SPECIFIC
status on both the when both the IO
VMs are HA Subsystem PROBLEM
redundant VMs using VMs are in same
Active Or Standby Failure the command ha HA states (Active, DETAIL=
status. If the ha Standby, OOS). Both the Diameter IO
status is ACTIVE VMs are HA Active or
on both the VMs, Stand-by: 51999
then verify if the
DDE-APPLICATION#;
VRRP connectivity is
Both the Diameter IO
broken between both
VMs are HA Active or
the VMs. This could
Stand-by.
be a communication
failure between the
VMs.
If ha status is
OOS or STANDBY on
both VMs, verify the
logs in /var/log/
ha on the VMs for
possible indications.
Execute the
command ha
enable verbose
on a VM, to view the
detailed logging in /
var/log/ha.
Execute the
command ha
disable verbose
on the VM to disable
the ha traces.
Both the SS7 IO $HOSTNAME QualityOfService application NA Critical Yes Verify the HA This alarm is raised SPECIFIC
status on both the when both the
VMs are HA Subsystem PROBLEM
redundant VMs using SS7 IO VMs are
Active Or Standby Failure the command ha in same HA states DETAIL=
status. If the ha (Active, Standby, Both the SS7 IO VMs
status is ACTIVE OOS). are HA Active or
on both the VMs, Stand-by: 51999
then verify if the
DDE-APPLICATION#;
VRRP connectivity is
Both the SS7 IO VMs
broken between both
are HA Active or
the VMs. This could
Stand-by.
be a communication
failure between the
VMs.
If ha status is
OOS or STANDBY on
both VMs, verify the
logs in /var/log/
ha on the VMs for
possible indications.
Execute the
command ha
enable verbose
on a VM, to view the
detailed logging in /
var/log/ha.
Execute the
command ha
disable verbose
on the VM to disable
the ha traces.
RabbitMQ $HOSTNAME communication communications Not Major No Check in the This alarm is raised 53011: DDE_
DDESMApplication. when the APP SYSTEM#
Consume Alarm Protocol Appli
log if the or IO nodes are
RabitMQ Event
Failed Error cable rabbitmq unable to consume
Consumption
notifications about the RabbitMQ
add, update, delete notifications. Failed
The
DDESMApplication.
log indicates
whether the
notification is
consumed or not.
Possible failures
should be evident in
the log.
For example,
no database
connectivity.
Example: restore
the database
connectivity.
reception of the
change in App or
IO nodes, clear the
alarm manually.
Note: The Peer Disconnect detected alarm is not raised during any of the abrupt switchover
scenarios (such as reboot, power off, not responding, and so on) on the diameter IO
node. This alarm can be observed only during the graceful switchover of CSD IO such as
DDE_control become slave.
Note: If the Mux connection is not stable due to internal network issues or due to repeated
Block or Unblock of Mux communication ports, then the Peer Connections Mismatch with IO
alarm may be raised. This alarm is automatically cleared within 30 seconds.
Note: Post upgrade to CSD 20.2 Release, the raised alarms which were introduced after
CSD 19.2 Release (for VMware) or the alarms introduced after CSD 19.5 Release (for
OpenStack), are not cleared after falling back to the base release. These alarms must be
cleared manually (Forexample: DBClusterFailure, TDR Collection failed and so on).
Attention: For successful TDR collection, passwordless authentication must be enabled for
ddeadmin user.
Following are the alarms that SM application generates if the Managed Element (ME) is unavailable
and cleared when the ME becomes available again:
The following table lists the SM application alarm details that are raised by the SM application.
x733 specific
x733 x733 event x733 probable Action to be
problem for x733 3gg code Severity Auto clear Notes Additional text
objectname type cause taken
trap
SystemOut $HOSTNAME processingAlarm outOfService Not applicable. Major No It is a This alarm is $ALARM
notification raised when CODE
ofService
which is used to the system that
(53003);
notify customer triggers it is
when the brought out of $ALARM
system is not service. GROUP;
reachable and
$SPECIFIC
Out of Service.
As an action PROBLEM
user need to DETAIL
restart the
node.
XdrCluster $HOSTNAME processingAlarm application Not applicable. Major No It is a This alarm is $SPECIFIC
notification raised when
Unreachable Subsystem CLUSTER;
which is used to the host that
Alarm Failure notify customer triggers this $ALARM
when other site alarm detects CODE
goes down or is that an XDR (53004);
not reachable. cluster is
$ALARM
As an action, unreachable.
customer needs GROUP;
to login to $SPECIFIC
respective node
PROBLEM
and restart the
node. DETAIL
XdrTimelag $HOSTNAME processingAlarm application Not applicable. Major No It is a This alarm is $ALARM
notification raised when CODE(53005);
Alarm Subsystem
which is used the host that
$ALARM
Failure to notify the triggers this
user that the alarm detects GROUP;
host detects that an XDR $SPECIFIC
that the XDR cluster has too
PROBLEM
cluster has too much time lag.
much lag time. DETAIL
As an action,
user needs to
check the logs
for node failure.
Xdr $HOSTNAME processingAlarm application Not applicable. Minor No It is a This alarm is $ALARM
notification raised when CODE(53006);
Outstanding Subsystem
which is used to the host that
$ALARM
Records Failure notify the user triggers this
that the host alarm detects GROUP;
Alarm
detects that that an XDR $SPECIFIC
XDR cluster cluster has
PROBLEM
has too many too many
outstanding outstanding DETAIL
records. records.
Data $HOSTNAME processingAlarm unspecified Not applicable. Minor No It is a This alarm is $ALARM
notification raised when CODE(53007);
Migration Reason
which is used the host that
$ALARM
InProgress to notify the triggers it
customer detects that GROUP;
Alarm
when the data data migration $SPECIFIC
migration starts. is in progress.
PROBLEM
No action is
needed if it is DETAIL
expected.
Xdr $HOSTNAME processingAlarm application Not applicable. Major No It is a This alarm is $ALARM
notification raised when CODE
Digestlog Subsystem
which is used Digestlog
(53008);
Threshold Failure to notify that the has reached
XDR digestlog the threshold $ALARM
Alarm
has reached percentage GROUP;
x733 specific
x733 x733 event x733 probable Action to be
problem for x733 3gg code Severity Auto clear Notes Additional text
objectname type cause taken
trap
Both the OAM $HOSTNAME QualityOfService application Not applicable. Critical Yes Verify the HA This alarm is SPECIFIC
status on both raised when
VMs are HA Subsystem PROBLEM
the redundant both the OAM
Active Or Failure VMs using the VMs are in DETAIL=
Standby command ha same HA Both the OAM
status. If the states (Active, VMs are HA
ha status Standby, OOS). Active or Stand-
is ACTIVE on by: 51999
both the VMs,
DDE-
then verify
APPLICATION#;
if the VRRP
Both the OAM
connectivity is
VMs are HA
broken between
Active or Stand-
both the VMs.
by.
This could be a
communication
failure between
the VMs.
If ha status
is OOS or
STANDBY on
both VMs,
verify the logs
in /var/log/
ha on the VMs
for possible
indications.
Execute the
command
ha enable
verbose on a
VM, to view the
detailed logging
in /var/log/
ha.
Execute the
command
ha disable
verbose on
the VM to
disable the ha
traces.
To recover
VMs from split-
brain, after
x733 specific
x733 x733 event x733 probable Action to be
problem for x733 3gg code Severity Auto clear Notes Additional text
objectname type cause taken
trap
the network
or external
factors which
lead to issue
are corrected,
perform HA
restart on both
the VMs using
the command
ha restart
on both the
VMs.
Note: This
alarm is
raised and
cleared during
instantiation.
This is an
expected
behavior.
HA Master HA master processing application Major Not applicable. Not applicable. Indicates the Yes Node transitioning
Recovering keepalive to ACTIV.
<hostname> ErrorAlarm Subsystem
component that
#HA Failure has experienced
a failover. This
is triggered once
the new Master
becomes available.
No further action is
necessary.
Excessive <hostname> QualityOfService authentication Minor Exceeded the EXCESSIVE It is automatically Yes Success/Failure
authentication threshold for AUTHENTICATION unblocked after indication,
#SECURITY Failure
failures the number of FAILURES LOG- the time period Login ID, Event
consecutive IN ACCESS IS specified by the Description,
login failures TEMPORARILY LOCKTIMEOUT Source IP Address
(MAX_LOGIN_ DISABLED parameter.
FAILURES) FOR ACCOUNT
tester[OS] default
value for MAX_
LOGIN_FAILURES
is 6 default
value for the
LOCKTIMEOUT
parameter is 5
minutes
DORMANT UNIX <hostname> QualityOfService Theshold Minor Warning that an ACCOUNT Not applicable. No Success/Failure
ACCOUNT account is about to DORMANT FOR indication,
#SECURITY Crossed
be locked due to AT LEAST b Login ID, Event
lack of use DAYS NO ACTION Description,
TAKEN. Source IP Address
Default b = 45
DORMANT <hostname> QualityOfService Theshold Minor An account is ACCOUNT Not applicable. No Success/Failure
UNIX ACCOUNT locked due to lack DORMANT FOR indication,
#SECURITY Crossed
LOCKED of use AT LEAST b DAYS Login ID, Event
Default b = 60
SECURITY <hostname> QualityOfService Theshold Minor Not applicable. Not applicable. No Success/Failure
AUDIT LOGGING indication,
#SECURITY Crossed
STARTED Login ID, Event
Description,
Source IP Address
SECURITY <hostname> QualityOfService Theshold Minor Not applicable. Not applicable. No Success/Failure
AUDIT LOGGING indication,
#SECURITY Crossed
STOPPED Login ID, Event
Description,
Source IP Address
SECURITY LOG <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
DATE OR TIME <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
FILE <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
USER ACCOUNT <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
ETCD Cluster oame qualityOf outofService Clear No action needed. Refer to the text
Service in ETCD cluster
#ETCD Service Service
Degradation health check.
Alarm
ETCD Cluster oame qualityOf outofService Major Manually check the No Refer to the text
Service ETCD status. in ETCD cluster
#ETCD Service Service
Degradation health check.
Alarm
ETCD Cluster oame qualityOf outofService Critical Manually check the No Refer to the text
Service ETCD status. in ETCD cluster
#ETCD Service Service
Degradation health check.
Alarm
HA Master HA master processingErrorAlarm application Major Not applicable. Not applicable. Indicates the Yes Node transitioning
Recovering keepalive to ACTIV.
<hostname> Subsystem
component that
#HA Failure has experienced
a failover. This
is triggered once
the new Master
becomes available.
No further action is
necessary.
Excessive <hostname> QualityOfService authenticationFailure Minor Exceeded the EXCESSIVE It is automatically Yes Success/Failure
authentication threshold for AUTHENTICATION unblocked after indication,
#SECURITY
failures the number of FAILURES LOG- the time period Login ID, Event
consecutive IN ACCESS IS specified by the Description,
login failures TEMPORARILY LOCKTIMEOUT Source IP Address
(MAX_LOGIN_ DISABLED parameter.
FAILURES) FOR ACCOUNT
tester[OS] default
value for MAX_
LOGIN_FAILURES
is 6 default
value for the
LOCKTIMEOUT
parameter is 5
minutes
DORMANT UNIX <hostname> QualityOfService ThesholdCrossed Minor Warning that an ACCOUNT Not applicable. No Success/Failure
ACCOUNT account is about to DORMANT FOR indication,
#SECURITY
be locked due to AT LEAST b Login ID, Event
lack of use DAYS NO ACTION Description,
TAKEN. Default b Source IP Address
= 45
DORMANT <hostname> QualityOfService ThesholdCrossed Minor An account is ACCOUNT Not applicable. No Success/Failure
UNIX ACCOUNT locked due to lack DORMANT FOR indication,
#SECURITY
LOCKED of use AT LEAST b DAYS Login ID, Event
ACCOUNT HAS Description,
BEEN LOCKED. Source IP Address
Default b = 60
SECURITY <hostname> QualityOfService ThesholdCrossed Minor Not applicable. Not applicable. No Success/Failure
AUDIT LOGGING indication,
#SECURITY
STARTED Login ID, Event
Description,
Source IP Address
SECURITY <hostname> QualityOfService ThesholdCrossed Minor Not applicable. Not applicable. No Success/Failure
AUDIT LOGGING indication,
#SECURITY
STOPPED Login ID, Event
Description,
Source IP Address
SECURITY LOG <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
DATE OR TIME <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
FILE <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
USER ACCOUNT <hostname> Processing Error Configuration Minor Not applicable. Not applicable. No Success/Failure
MODIFICATION indication,
#SECURITY or
Login ID, Event
Customization Description,
Error Source IP Address
ETCD Cluster oame qualityOfService outofService Clear No action needed. Refer to the text
Service in ETCD cluster
#ETCD Alarm
Degradation health check.
Service
ETCD Cluster oame qualityOfService outofService Major Manually check the Refer to the text
Service ETCD status. in ETCD cluster
#ETCD Alarm
Degradation health check.
Service
ETCD Cluster oame qualityOfService outofService Critical Manually check the Refer to the text
Service ETCD status. in ETCD cluster
#ETCD Alarm
Degradation health check.
Service
Error
Error
Error
Error
Error
Note: During switchover of OAM nodes, the alarms (raise or clear) may not be sent from
CSD to NetAct. This depends on the duration for which Virtual IP address is unplumbed and
plumbed on the VM by OS network utilities.
In this case, alarms must be cleared manually post successful execution of the following
operations on active OAM VM:
• Rebuild
• Reboot
• Switchover
5 Logging functions
Description
• Provisioning log: This log refers to Provisioning transaction log. It contains information about all
provisioning transactions performed using either Graphical User Interface (GUI) or an upstream
driving system, such as REST Application Programming Interface (API).
• Managed Element log: This log refers to Managed Element log. It contains information, such as
ME name for the requests send to Managed Elements (MEs), ME info, or error responses.
• User Access log: This log refers to User Access log. It contains all information related to user
login or number of successful/unsuccessful attempts.
• Error log: This log has all functional area logs with severity levels defined as WARN, ERROR,
and FATAL as per the requirements.
• DDESMApplication log: This is a default log that records all operations initiated by SM users not
covered by other logs.
TRACE
ERROR
The path of all log files is /opt/tpa/logs and alarms are logged in file /var/log/alarms.
5.1.2 SM logs
Error log
DDESMApplication.log
Node Location
*/<VM name>
Node Location
For events, alerts
• /opt/gangalia/log/gen3GPPXmlPM.log
• /var/log/calm/alma.log
Installation logs:
/opt/tpa/logs/SPI.*.log
/opt/tpa/logs/install_DDE.*.log
IO node /opt/tpa/logs/DDESMApplication.log
DB nodes /opt/tpa/logs/aerospike.log
The default logging level for many functions is informational (or INFO), which minimizes the number
of log entries. The logging level for many internal functions can be configured to provide more
information for troubleshooting purposes, i.e. setting the logging level to DEBUG. To view this
You can specify one of the following case-sensitive logging levels for each functional area:
Level Description
Level Description
Note:
Note that enabling DEBUG-level logging will generate an extremely large number of logs,
which can affect system performance. Ensure that you lower the logging level for only the
required functions and for only the time required to collect the necessary information.
Also, changing the log level does not persist after an application server restart. For a
persistent change, you must modify the log4j.xml file accordingly.
A value of NULL means that a level has not been explicitly set for the specified logger.
If a level has not been set for the specified logger, a previous level of INFO is assumed.
DDE_logLevel reset_all
Note: OSSI log collection requires the use of system resources that may affect system
performance. Ensure that you only collect the logs with the troubleshooting information
required. The system does not automatically delete OSSI logs. You must periodically delete
the logs manually.
<logger name="com.nokia.dde">
<level value="INFO"/>
</logger>
4. Enter the following line above or below the lines to which you navigated in step 3
</logger
<!-- Warning, the following entry can be use for debugging OSSI
1. Open a console window and log in to the application blade as root user.
2. Navigate to /opt/tpa/bin/DDE_loglevel.
3. For debugging purposes set the log level to verify the traffic behavior.
For more information, refer the section Tracing and Debugging on page 144.
3. Use a text editor to open a .log file to view the file. For example, view the
DDESMApplication.log file by typing:
# vi DDESMApplication.log
Overview
On both CSD and SM nodes, logs related to CSD and CSF both get rotated either when they reach
their maximum size configured or maximum time reached. On CSD, these configurations of CSD logs
are mentioned in /opt/tpa/logs/log4j.xml (present on all nodes including APP, IO and OAM
except DB) and for CSF logs, configuration is mentioned in /etc/logrotate_syslog.conf on all
nodes.
[root@ddebvnf-oame-0 logs]# ls -1
ddeLogrotate.conf
DDEOperations.log
DDESecurity.log
DDESMApplication.0.log.gz
DDESMApplication.1.log.gz
DDESMApplication.2.log.gz
DDESMApplication.3.log.gz
DDESMApplication.log
DDE_system.log
DiameterInterceptor.log
gc.log.0
gc.log.1
gc.log.2
gc.log.3
gc.log.4.current
HADiagnosticsDefault.log
install_DDE.2018.09.11-14-28.log
log4j.xml
PolicyDecision.log
Server.0.log.gz
Server.1.log.gz
Server.2.log.gz
Server.log
SPI.2018.09.11-14-27.log
SubscriberTrace.log
[root@ddebvnf-db-1 logs]# ls -1
aerospike.log
aerospike.log.1
aerospike.log.10.gz
aerospike.log.11.gz
aerospike.log.12.gz
aerospike.log.2.gz
aerospike.log.3.gz
aerospike.log.4.gz
aerospike.log.5.gz
aerospike.log.6.gz
aerospike.log.7.gz
aerospike.log.8.gz
aerospike.log.9.gz
aerospike_warning.log
ddeLogrotate.conf
id9YZ-09148-MT11-PCZZA © 2020 Nokia 201
2.0
Monitoring and Troubleshooting Guide Logging functions
DDE_system.log
install_DDE.2018.09.11-14-44.log
SPI.2018.09.11-14-43.log
sa
sdc_alarm.log
sdc.log
secpam-boot.log
secure
secure.1
secure.2.gz
secure.3.gz
secure.4.gz
snmptt
spooler
syncer
tallylog
tuned
watson
wtmp
yum.log
6 Network Troubleshooting
Overview
This chapter describes how to manage network functions and troubleshoot common network problems
that arises in the CSD.
The following table lists the common network problems in CSD along with the solution.
Problem Solution
Host unreachable when CBAM and VNF are in Before installation, ensure that CSD or SM VNF
different subnets. gateway are reachable through CBAM.
Diameter Peer not reachable due to missing Add appropriate routes towards remote peer on
Routes. When the CSD and the remote peer IO nodes for reachability.
are on different networks, CSD fails to reach the
remote peer due to the route unavailability.
Upgrade or rollback failure with the error Re-trigger upgrade or rollback when you
Destination Host Unreachable. encounter with the error Destination Host
Unreachable.
Note: For details on release changes prior to 18.8 SP1, refer the Monitoring and
Troubleshooting Guide, 9YZ-08354-MT11-PCZZA.
7.1 Scripts
Overview
The following are the list of scripts which are present in /opt/tpa/bin.
• configure-geo-redundancy
• db-collect-logs
• distribute-diameter-dictionaries
• list-diameter-routes
• config-backup
• db-backup
• db-restore
• list-diameter-peers
• log-level
DDE_BM_18.2_Ixxx.tar.gz NOKIAcdrom-CSD_18_8_Ixxx-1.x86_64.rpm
Diameter Stack Status: LISTENING or NOT Diameter Stack Status: STARTED or NOT STARTED.
LISTENING.
/opt/tpa/logs
The following is the sample provisioning log without The following is the sample provisioning log indicating
username. the username who performed the provisioning
transactions.
Status of provisioning the entity
DiameterPeerConfig
smadmin has provisioned the entity
{id=DiameterPeerConfigID
DiameterPeerConfig
{diamPeerId=529163938},
{id=DiameterPeerConfigID
portNumber=0, active=true,
{diamPeerId=529163938},
addressGroupName='Tcp1',
portNumber=0, active=true,
secondaryAddresses={}, secure=false,
addressGroupName='Tcp1',
description='null',
secondaryAddresses={},
locallyInitiated=false, requestTimeout=0,
secure=false, description='null',
numberOfSendAttempts=0, sourcePortNumber=0,
locallyInitiated=false,
primaryAddress='',
requestTimeout=0, numberOfSendAttempts=0,
protocol='TCP',
sourcePortNumber=0, primaryAddress='',
ingressPeerThrottlingProfileName='null',
protocol='TCP',
egressPeerThrottlingProfileName='null',
ingressPeerThrottlingProfileName='null',
peerConnectionProfileName='null',
egressPeerThrottlingProfileName='null',
downStreamOverLoadProfileName='null'
peerConnectionProfileName='null',
, sctpAssociationProfileName='null',
downStreamOverLoadProfileName='null',
requestTimoutProfileName='null',
sctpAssociationProfileName='null',
originHost='peer',
requestTimoutProfileName='null',
peerType='Fqdn', createdDate='1580121875577',
originHost='peer', peerType='Fqdn',
rank='0'}
createdDate='1580121875577', rank='0'} and
to Managed Element : ME is SUCCESS
Status of provisioning
to Managed Element : ME is SUCCESS
The following criteria were persent in the message The following unsupported criteria are removed from
filters of the SM GUI. the message filters in SM GUI.
• Any Diameter-Answer (For example, GxCCA, • Any Diameter-Answer (For example, GxCCA,
ShPUA, S6aULA, and so on). ShPUA, S6aULA, and so on).
• Diameter-CER • Diameter-CER
• Diameter-CEA • Diameter-CEA
• Diameter-DPR • Diameter-DPR
• Diameter-DPA • Diameter-DPA
In case of routing failure or peer is in down state, CSD For any 3xxx error responses the Auth-
sends Auth-Application-Id as a single AVP or Application-Id is not sent either as a single AVP
as part of Vendor-Specific-Application-Id or as part of Vendor-Specific-Application-Id.
in the 3xxx error responses. However in scenarios
However, if the 3xxx response is sent by the remote
like Request Timeout and Loop Detection, Auth-
peer containing Auth-Application-Id, then it
Application-Id is not included in the 3xxx error
is relayed back to the client as it is with the Auth-
responses.
Application-Id.
DDE_control status command performs graceful DDE_control status command now performs
switchover of IO and OAM nodes. graceful switchover of OAM nodes.
1. DDE_control close-all-connections
2. DDE_control close-all-ha-connections
Execution of DDE_Status on both IO and OAM Execution of DDE_Status on both IO and OAM
nodes: nodes:
Switchover of IO node is performed using the following Switchover of IO node is performed using the following
command: command:
Stateful closing of all connections (static and floating) Stateful closing of all connections (static and floating)
cannot be performed. can now be performed using the following command:
DDE_control close-all-connections
Route-Record AVP gets added to Diameter-Answer Route-Record AVP will not be added to Diameter-
messages sent by CSD. Answer message sent by CSD.
The XinetD service enables you to start programs The XinetD service is removed as this service can be
which provide internet services. a security vulnerability.
/appdata/ganglia/meas/ /appdata/ganglia/meas/cpro
CBAM-51328829b76e4fd4a8ced03173fa548c
7.2.11 healthMachines
This tool lists all the nodes with its HA and service status. The HA status of shutdown VMs is changed.
You can access this tool from the /opt/tpa/bin directory as a root user.
HA Status HA Status
Unknown NA
For more details, refer the Upgrade section in the Installation and Upgrade Operations Guide for Cloud
Deployments.
For more details, see the Metrics configuration section in the CSD User Guide.
The following table lists the changes from previous release to this release.
metrics.n.csv <startdate.timestamp+timezone>-<enddate.timestamp
+timezone>_sitename.csv
(A20181026.085500+0000-20181026.090000+0000_
v700DDE.csv)
/opt/tpa/logs /opt/tpa/logs/metrics/csvfiles
<startdate.timestamp+timezone>-<enddate.timestamp <startdate.timestamp+localtimezone>-<enddate.
+timezone>_sitename.csv timestamp+localtimezone>_sitename.csv
(A20181026.085500+0000-20181026.090000+0000_ (A20190701.122500+0000-20190701.123000+0000_
v700DDE.csv) netco.csv)
Fields with reduced number of columns. Fields with reduced number of columns.
Note: The value of Site in the metric file displays the geoSitename configured in the
vnfd.scalable.tosca.yaml file.
For more details, see the Transaction Data Record (TDR) sections in the CSD User Guide, Installation
and Upgrade Operations Guide for Cloud Deployments.
For more, details refer the Analytics section in the CSD User Guide.
Dashboard Analytics
7.2.17 Improper disk utilization of all the VMs during OAM node reboot
This section provides information on common errors or issues encountered during reboot of OAM
nodes using the following command:
reboot -f
Problem Solution
The counters related to disk utilization in the Update the value of collect_every and time_
3GPP system counter file for all the VMs remains threshold of disk.total to 180 seconds in
improper during OAM VM reboot duration. the following file for all the VMs.
/etc/ganglia/gmond.conf
SACK Time-out:
For more details see, To create an SCTP association profile section in the CSD User Guide.
Flush All
– Block
– Unblock
– Disconnect
For more details, see the Peer Management Dashboard section in the CSD User Guide.
• Port 22 - with the transport typeTCP of SSH • Port 8443 - with transport type TCP provides
service on IO nodes. HTTPD service OAM node.
• Port 8444 - with transport type TCP provides
prometheus service to OAM node.
• Port 8888 - with the transport type TCP of • Port 5775 - with transport type UDP provides
XinetD is removed. Jaeger service to OAM node.
• Port 8890 - with the transport type TCP of • Port 6831 - with transport type UDP provides
XinetD is removed. Jaeger service to OAM node.
• Port 6832 - with transport type UDP provides
Jaeger service to OAM node.
• Port 14250 - with transport type TCP
provides Jaeger service to OAM node.
• Port 14267 - with transport type TCP
provides Jaeger service to OAM node.
• Port 14268 - with transport type TCP
provides Jaeger service to OAM node.
• Port 14269 - with transport type TCP
provides Jaeger service to OAM node.
• Port 16686 - with transport type TCP
provides Jaeger service to OAM node.
• Port 9093 - with transport type TCP provides
alert manager service to OAM node.
• No change.
• Port 3306 - with the transport type TCP of • Port 15672 - with transport type TCP
MariaDB is removed. provides RabbitMQ service to OAM node.
• Port 10050 - with the transport type TCP of • Port 15692 - with transport type TCP
Zabbix is removed. provides RabbitMQ service to OAM node.
• Port 10051 - with the transport type TCP of
Zabbix is removed.
• No change.
By default, the OVF templates consists of a root disk. Any other disks are known as independent
disks.
With this enhancement in VMware, data can be restored after performing heal (rebuild) operation.
Total number of connected peers was not Connections - Displays the total number of
displayed. connections.
Clicking the peer name displays the peers from Clicking the peer name displays the peer name
the diameter peer form. and the profiles attached to the peer.
Status - The criteria in Status supports only AND Status - Enables you to select the combination of
criteria. the following actions.
• Provisioned
• Partial Provision
CSD Diameter Routing Advanced panel CSD Diameter Routing Advanced panel
For a SCTP multihomed peer connection, For a SCTP multihomed peer connection, both
single IP address was displayed under the configured IP addresses are displayed in
the output of listDiameterPeers.sh, if the
The alarm logs are stored in the following path: The alarm logs are stored in the following path:
/var/log/alarms /var/log/calm/alma.log
/var/log/calm/history.log
Note: From the release 20.5 onwards, the Peer disconnect detected alarm with the severity
Minor is deprecated.
Brevity control - Whenever the peer bounce Brevity control - Whenever the peer bounce
is observed for the same peer for more than is observed for the same peer for more than
two times in the duration a five-minutes, then ten times in the duration of two minutes, then
only two Peer disconnect detected alarms and only ten Peer disconnect detected alarms and
its corresponding clear alarm aretriggered. its corresponding clear alarms are triggered.
Next subsequent alarms for the same peer are Next subsequent alarms for the same peer are
suppressed due to brevity control. suppressed for the duration of two minutes due to
brevity control.
Table 77: Changes in Brevity control for Peer disconnect detected alarm
Command Description
Command Description
log in to the DB node as a asadm and get the Enables you to verify the database cinder and
output of info memory usage.
Command Description
Command Description