As a customer who uses LwM2M agent quite extensively, we have observed over the course of LwM2M agent upgrades/versions that there have been changes in LwM2M agent logging and messages that are being flagged as exceptions or warnings that were previously not visible/reported to customers. This can cause confusion and mislead operational users especially during monitoring, and impact health reporting if dashboards are tuned to listen on these exceptions. One example was during the LwM2M agent instability period and hotfix deployment - the agent started showing a number of 'Error' and 'Warn' messages, for example (see below).
Only once we had conversation with R&D did we understand that these messages were not a cause for concern.
This request covers the need for greater clarity and documentation on expected logging messages and message definitions on the LwM2M agent (e.g. what is expected, what Error messages should actually raise concern and/or should be disregarded.) Also, any expected changes in logging that Operational users should be aware of between version upgrades would be appreciated. Of course we would like this shared prior to LwM2M Agent deployment so that downstream monitoring can be adjusted accordingly.
Thanks,
Tiana.
........
ERROR
ERROR 22232 --- [CoapServer(main)#1929] o.e.l.c.californium.LwM2mCoapResource : Exception while handling request [CON-POST MID= 50, Token=B5, OptionSet={"Uri-Path":"rd", "Content-Format":"application/link-format", "Uri-Query":["lt=115200","lwm2m=1.0","ep=urn:imei:867997033089086","b=UQ"]}, "</4/0>,</10262/0>,</1026".. 236 bytes] on the resource /rd from Identity /172.29.0.143:43781[unsecure]
java.lang.IllegalMonitorStateException: Current thread is not owner of the lock! -> <not-locked>
ERROR 123182 --- [UDP-Sender-0.0.0.0/0.0.0.0:5683[0]] o.e.californium.elements.UDPConnector : Exception in network stage thread [UDP-Sender-0.0.0.0/0.0.0.0:5683[0]]:
java.lang.NullPointerException: null
WARN - Should we be concerned about the WARN ones? They all seem related to the Partition.
Client is offline, exception: org.eclipse.leshan.core.request.exception.TimeoutException
[192.168.3.77]:5900 [dev] [3.12.7] Retrying local execution: com.hazelcast.internal.util.LocalRetryableExecution@1326dcb, Reason: com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this: [192.168.3.77]:5900, partitionId: 25, operation: com.hazelcast.map.impl.operation.MapPartitionDestroyOperation, service: null
WARN 9796 --- [hz._hzInstance_1_dev.partition-operation.thread-8] c.h.i.util.LocalRetryableExecution : [192.168.3.77]:5900 [dev] [3.12.7] Retrying local execution: com.hazelcast.internal.util.LocalRetryableExecution@1f1c46f0, Reason: com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this: [192.168.3.77]:5900, partitionId: 8, operation: com.hazelcast.map.impl.operation.MapPartitionDestroyOperation, service: null
WARN 22232 --- [hz._hzInstance_1_dev.partition-operation.thread-25] c.h.s.i.o.impl.Invocation : [192.168.3.77]:5900 [dev] [3.12.7] Retrying invocation: Invocation{op=com.hazelcast.concurrent.lock.operations.LockOperation{serviceName='hz:impl:lockService', identityHash=585607434, partitionId=25, replicaIndex=0, callId=20910675, invocationTime=1663696772045 (2022-09-21 03:59:32.045), waitTimeout=-1, callTimeout=60000, namespace=InternalLockNamespace{service='hz:impl:lockService', objectName=cPMD9iCjIy}, threadId=2043}, tryCount=250, tryPauseMillis=500, invokeCount=210, callTimeoutMillis=60000, firstInvocationTimeMs=1663695449048, firstInvocationTime='2022-09-21 03:37:29.048', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 10:00:00.000', target=[192.168.3.77]:5900, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this: [192.168.3.77]:5900, partitionId: 25, operation: com.hazelcast.concurrent.lock.operations.LockOperation, service: hz:impl:lockService
.......
Hi Tiana,
thanks a lot for your feedback.
When it comes to logging, we are distinguishing between:
Agent Log files: These are the LWM2M agent logs and optimised not for end-users but for consumption by the Expert Operations and Development teams to be able to fully understand what a program is doing and - in the LWM2M case - how it interacts with the devices. The agent log entries are not stable and the development and operations teams are constantly learning and improving/changing the log messages to help them troubleshoot issues. Quite a bit of work has been done in past releases to improve the error logging.
Alarm Messages: By contrast, alarm messages are meant for consumption for the end users. They have to be phrased in a way that they are understandable without knowing how the system is working. We are grateful for any feedback on alarm messages that were causing confusion for end-users.
In addition to this, we are working to add an additional monitoring endpoint directly to the LWM2M agent that next to general health metrics does also report LWM2M specific metrics (LWM2M registrations per minute, max. concurrent active devices...) which would allow you to monitor the health of the agent without the necessity to look into the log files. We will make sure that the metrics are properly documented in the operations guide. We hope that this endpoint will take away the need for you to do logfile analysis while still being confident that the agent works properly.
Have a great Christmas break and successfull start into the new Year!
Nikolaus