Are you seeing any Data Lost events?
I have not seen Data Lost events. However, since I've had to kill Mango to complete the last several restarts, if there were a data loss it would not have generated an event at that time.
If so these should enlighten us as to why this is happening, they get raised whenever a batch fails to write.
Thanks, that's reassuring. I had not remembered that there was such an event.
I would make sure that event level is set to something other than Do No Log too.
I had apparently set that event to critical. I did not see any such events, and in the last month I re-enabled critical events emails forwarding through my mobile phone network's email to SMS gateway. My phone sounds an attention getting submarine dive alarm when receiving Mango text messages.
Do you have any Alphanumeric points that would be saving very large strings of text?
Only one. I tried listing ALPHANUMERIC points in the table on the data_sources page, but for some reason it would not show a full query response (there were hourglasses at the bottom of the list). In any case, the JSONata command-line utility jfq enables me to specify more detailed reporting of the Mango configuration.
# Show name and loggingType and context update of all enabled alphanumeric points that are configured to log their values: $ jfq 'dataPoints[pointLocator.dataType="ALPHANUMERIC"][enabled=true][loggingType!="NONE"].[name,enabled,loggingType,pointLocator.context]' Mango-Configuration-Jul-17-2019_000500.json
I generate a large JSON structure to predict the tides based on a pressure sensor data. The table is displayed in a Mango 2.x graphical view via a server-side script graphical object that converts the JSON to an HTML table. The table is apparently using the Alphanumeric_Default template, which saves "When point value changes." However, it only updates context when the tide direction changes, which is only a handful of times per day. The only other alphanumeric point being logged has a context that triggers only twice a year.
The NoSQL database has a limit to the size of each entry, which is large but its possible this could cause these symptoms.
Point values to be writtenrises by about 1,700 values per day, which far exceeds the number of times the tide changes direction in one day.
I see there is now a Failed Login event. That's wonderful, thanks.
I just changed
the point clean interval (ms) and the stale data period (ms) to 60000
I will report back whether or not I see a change in the trend of the
Point values to be written. It should take an hour or two to see a change.
Thanks for your help.
If it's just the count getting off and not values accruing in memory, it's not a big problem.
Thanks. Hopefully that's the case, but I think there are accrued values hiding somewhere, as that would explain the prolonged shutdowns I experienced when I restarted Mango, whereby I eventually had to kill Mango. Accrued values aside, I need to be able to generate history after creating or changing metadata points, as they are of limited value without a history. I'm thinking that whatever is underlying this symptom is also affecting my ability to generate metadata value histories.
Sometimes when I generate history I'm wondering if I'm going to push Mango over the edge, where excessive resource demands cause a positive feedback loop of error messages that demand more resources, eventually requiring a restart for Mango to catch up. This occurred a year or two ago, but I can't recall whether it was a history generation or an Internet outage (and inability to send email alarms) that kicked off the positive feedback loop.
It would be nice to have automatic ad-hoc metadata points as a feature enhancement: if the metadata value is not found in the TSDB point log (e.g. because it was not logged), then it should be calculated at the time it is retrieved. This would result in calculating historical values only when the data is being retrieved, thus reducing computational load and the storage demand, yet still making it available for review or download on the occasion where the data is needed for analysis. Since the value would be calculated upon retrieval, values that are never retrieved would not have to be calculated or recorded. The values could optionally be logged when they are calculated, as they are now. Live values would still be displayed. Should I submit this feature request on Github?
Have you checked your log file and events for any errors in writing?
I did not find any ERRORs, but when I searched for tsdb I found WARNings:
07-14-2019-1.ma.log:WARN 2019-07-14T13:46:10,746 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T05:05:10,802 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T06:35:10,970 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T12:13:51,286 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T12:31:48,767 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T13:20:10,772 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T16:35:10,747 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-12-2019-1.ma.log:WARN 2019-07-12T17:35:11,002 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-13-2019-1.ma.log:WARN 2019-07-13T12:53:03,262 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-13-2019-1.ma.log:WARN 2019-07-13T16:04:19,245 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T04:35:10,769 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T06:20:10,757 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1 07-14-2019-1.ma.log:WARN 2019-07-14T13:46:10,746 (com.infiniteautomation.tsdb.impl.Backdates$BackdatePoster.runImpl:187) - The backdate poster ran without inserting, queue size: 1
I had noticed that message a few weeks ago, after running a Generate History on a metadata point, but I had thought that was related to a large meta point history generation and that it would resolve itself after the task queues caught up when the CPU utilization fell back down. By the time I was noticed the rising values to be written, I had forgotten about the backdate issue. I think the Generate History function may be triggering or exacerbating the problem. However, the values to be written was rising daily long before I started the Generate History. Either way, I don't know what to do about it.
About how fast is it rising?
The rate is irregular when viewed on an hour timescale: the samples to be written may go down every few minutes, but the increases exceed the decreases and a linear trend is seen in the graph. It has been averaging out to an additional 1,700 values per day. Day/1700 ≈ 0.847 min/value ≈ 50.8 sec/value, so not your typical cron job interval. Viewed on a 1-day timescale, the values to be written plot looks like a straight line with a steady rising slope.
System uptime is 3,500 hours (almost 146 days). 3,500h/272,458values = 46.24 seconds/value in the last 146 days. 146 days ago the values to be written was 445,000, rising at 14,000values/day. At restart that day the values to be written reset to zero, then rose at roughly 1,800values/day. That must be the day I last upgraded Mango. If I remember correctly, I had to kill mango many minutes after initiating shutdown because it was taking too long to complete and I was concerned about prolonging a Mango outage.
$ date -d 'now - 146 days' Tue Feb 19 11:44:35 EST 2019
I poked around for large changes in internal data source readings before and after the update, and today, to see if they correlate with the rate of change in values to be written:
- Backdates running: usually 0 then and now.
- Backdates queued: Usually close to 0; 1 today. Spiked above 100 on June 20. The values to be written rose at a much faster rate around that time, but then the slope resumed to 1,700values/day. This was probably the day I ran the Generate History on a metadata point.
- Consecutive run polls: currently 0; spiked from 0 to 205,000 on June 15, then suddenly down to 0 on June 27.
- Currently scheduled high priority tasks is typically 12,000-18,000. Does not seem to correlate with other issues.
- Point value write threads is currently 3. It was 5 before the Feb 19 restart, and appears to be going up in proportion to the number of values to be written.
- Total backdates is currently 216,293. It had climbed slowly to 1,200 at which time it was reset to 0 at the Feb 19 restart. Since then it climbed slowly to around 3,000 on May 1, at which point it shot up suddenly to almost 60,000 that day, then resumed its slower rate of increase. On June 21 it shot up suddenly from about 60,000 to 215,000, then resumed its slower increase rate.
I believe the total backdates shot up when I ran a Generate History, and the system has not caught up with the backlog. This is odd because CPU utilization is usually more than 80% idle and CPU % IO wait is currently 0. If I had not ran the Generate History, it seems that these numbers would still be increasing indefinitely, but not at such a fast rate.
I graphed the
Point values to be writtenthree hours before and after the change in TSDB settings. It continues to rise at the same rate, and is now at 267,341 points. I saw no visible difference in the graph.
Database batch writes per secondvaries between 85 and 100, as it did before.
Thanks for your help. Yes, the values to be written are still climbing at a steady rate, as they have been at least for months, probably since last time I restarted Mango. I presume there were a lot of accumulated points to be written during my upgrade to 3.5.6, as that would explain the long shutdown time.
The TSDB setting values I posted yesterday were either the default when I converted to TSDB years ago, or they were changed to new values recommended by IA, probably around the day I converted to TSDB. I doubt I would have the courage to change them on my own without understanding their meaning. Since I have not seen documentation regarding the meaning of each of these settings, I have been afraid to change them. Before noticing that the
values to be writtenwas consistently increasing, I also did not see a need to change the settings, so I left them alone.
Per your suggestion I just changed only the following settings (old ==> new):
max write behind tasks: 100 ==> 10 increase the minimum time before flushing a small batch: 1 ==> 100 increase the maximum batch size: 1 ==> 10000 increase the minimum batch size: 1 ==> 1000
The default settings are geared toward quality of service, not necessarily throughput. I would only expect persistent TCP syncs or extreme numbers of points / poll rates to encounter that
There are 58 data sources (of which 15 are metadata sources). Most of the data sources are Modbus/TCP with a typical polling rate of either 1, 2, or 5 seconds.
There are about 1300 enabled data points, though many are from metadata sources.
(or slow disks, which are aided by batching writes, generally).
I don't think the issue is slow disks because iowait is low.
It's too soon to see if the trend of increasing
values to be writtenchanged. I'll check the
values to be writtenlater, and see if it started to decrease, then report back.
This is MA core 3.5.6 with NoSqlDtabase 3.5.1. Sorry the version I provided earlier made no sense.
The Point values to be written is now at 265,275 points. I don't usually look at this value, so it piled up at a steady rate before I noticed. Doesn't a low iowait indicate that a faster hard drive or lowering the write rate should not make a difference?
If this Mango version is new enough, what are the NoSQL fiddling recommendations?
I have never changed them from the default:
Batch process manager period (ms) 500 Batch write behind spawn threshold 100000 Max batch write behind tasks 100 Maximum batch size 1 Minimum batch size 1 Delay for writing backate batches (ms) 5000 Minimum time to wait before flushing a small batch (ms) 1 Point clean interval (ms) 1000 Status update period (ms) 5000 Stale data period (ms) 1000 Corruption scan task threads 100
Would changing these settings require a restart?
The Mango Internal data source attribute "Point values to be written" is climbing at a steady rate. This rate increased when I generated a metadata point history, but then resumed to the regular rate of an additional 1,700 values per day.
Where are these points being stored? In RAM or in a disk cache? If I do not take action, what can eventually happen?
During a Mango shutdown, are these points written to disk (causing a shutdown delay), or are they discarded?
More importantly, what are my options to reduce the "Point values to be written?"
- Reduce point logging rate?
- Install a faster hard drive?
- Will setting a non-zero Default cache size on certain points have any effect on this?
- Any other options?
Possibly relevant data from Mango sources:
- Mango 2.5 CPU idle is usually > 80%
- CPU % IO wait = usually 0%, with a few infrequent daily spikes to 12%.
- Free Memory appears to be steady around 200,000 bytes.
- JVM free memory = 3,431MB
- JVM maximum available memory 4,993MB
- Virtual swap memory is ~ 1MB
- Scheduled work items = 12,976 items
- Nothing relevant noticed in ma.log
I know I can query Mango to determine which points are taking up the most space in the database, but that is not reflective of the rate at which they are currently being written because:
- the purge age may be different
- the data may change, triggering more or fewer tolerance threshold value logging events
- the logging settings may have changed since the point was created
- recently created points may have a higher logging rate even with a much smaller total number of points
Is there a way I can find out which points are being written most often? For instance, obtain a ranked list of the number of values written for each point in the last 24 hours?
@phildunlap Thanks. After upgrading to Graphical Views 3.5.3, the grey brick now appears during mouseover on graphicalView objects, like it used to in earlier Mango versions. And more importantly, it links to the relevant point details page.
I have not changed the table annotation yet. Do I understand correctly that after I edit
web/WEB-INF/snippet/tableChart.jspI can delete the /opt/Mango/work/jsp directory while my Mango instance is running, and then just refresh the page?
Will the above functionality be restored? Specifically:
- Graphical object mouseover: link to corresponding point detail view
- mouseover value/time table: display login name that changed the point value
These features used to work, and were very handy. Thanks.
core Version 3.5.6
graphicalViews Version 3.5.2