Your comments all make sense. Responses in kind below:
- Do you have a device that replicates it? It sounds like your test device is just garbage collecting normally.
Base on your information, I would agree with your assessment that the issues have not been fully replicated. I have no Java programming experience, and (incorrectly) assumed gc operations were similar to Python's reference count.
So, it seems that the memory usage chart from my last post is "normal" for the JVM, and I have misinterpreted the large upward spikes of memory as a restart when it is, in fact, normal gc behavior. Would you agree with this assessment? In my time on site (chart from original post) I only saw those large blocks of memory freed when I had restarted the mango service, and, by extension, the underlying Java process.
- Usually the OOM killer would only get involved in Java was trying to claim more memory from the system, or if there was another task running on the machine that was attempting to claim too much memory.
Java by itself will grind away for quite a long while if there is a lower memory ceiling before it crashes, unless it tries to do a sudden large allocation (doesn't appear to be happening from what you've said.
These are MangoES devices, and all configuration was done through the Mango UI. I am unsure what other processes can be affecting OS memory usage that would not have been captured in a database copy. What OS information would be useful in troubleshooting other processes? Only
/opt/mango/hs_err*come to mind.
I've seen this with an Excel Report that had the named ranges defined out to 1 million for instance - there was a since-removed behavior in the Excel reports that tried to load and blank out extra cells in named ranges).
There is a single report configured (non-excel), but is run once a month. Last run duration was < 500ms. This does not seem to be the issue.
In the original post, it looks like there was a reset at about 5 AM on January 21st when Mango had a lot of memory still allocated to it and unused by Java. Either a massive sudden allocation occurred, or something else on the device is demanding the memory.
Would it be a reasonable assessment to attribute the Jan21 5am spike in free memory (original post) to gc operations? There was no one working on the system at that time, and memory use continued in the same pattern.
- I'm not convinced your test device had an OOM since you said there were not service interruptions. From this chart, I would think you could try decreasing the memory allocated to Java in /opt/mango/bin/ext-enabled/memory-small.sh such that Java never irritates the operating system. But, without having changed that, the default is pretty conservative to avoid OOM kills. This would make me question if something else running on the device is the issue.
Based on your information above, neither am I. If the issue had been fully replicated, I should see the same html client behavior and log generation as you mentioned in (1). If the issue has not been replicated, the only other step in troubleshooting I can consider is getting a MangoES v3 to load the backup databases. I'm at a loss for ideas here at this point because we are running on a device pre-configured for the mango service. Do you have any ideas on what else could be running?
You are unlikely to be using the same JDK 8 revision on both devices.
I can get the client to check JDK version overnight. Based on your last post, the Java process itself does not appear to be the root cause for the behavior. Does that make sense or am I jumping to conclusions?
Edited to include replies in context.
Quick recap: Attempting to replicate memory leak and OOM kill of the mango service of a MangoES in production environment overseas on a test MangoES locally. Service is killed by the OS, required hard device restart or service restart commands via terminal. Hotfix was implemented as a cron job to restart the service at 1AM daily - but this has side effects that are unacceptable to the client. Goal is to either eliminate memory leak or handle error silently.
Current Problem Status
I have the production SQL and NoSQL databases loaded onto the test MangoES and the memory consumption behavior is the same as seen on the production MangoES.
However, the response of the mango service on the test device is exactly what is desired in the face of memory exhaustion: the mango service restarts itself without and apparent service interruption. Memory is freed and system operation continues like nothing happened.
The graph below shows JVMfreemem on the test MangoES. I am unable to find anything unusual in the the archived ma.log during the points of major memory deallocation. The system simply chugs along as if nothing happened; no OS kill notice in dmesg, no hs_err log generated.
If the production device acted as elegantly this test one, the original memory leak would not have been an issue.
Test MangoES info (software stock except mango core upgrade):
- Hardware: MangoES v2
$uname -a --> Linux mangoES2227 18.104.22.168
$cat /etc/issue --> Mango ES version 1.0
$cat /etc/debian_version --> 8.1
$java -version --> 1.8.0_33
- mango core --> 3.2.2+20171009170034
- Running backup database but no other network devices available (i.e. no TSDB data being added)
Production device (what info I have):
- Hardware: MangoES v3, s/n 3436
- OS and mango modules stock from factory
- Fully connected to field devices over Modbus, logging times ranging from 15 seconds to 1 minute.
Questions to move forward:
(1) Now that I have a database that replicates the problem, can we find and rectify the root cause of the memory leak? I can make these files available as required.
(2) Why did I not see the same memory error handling behavior between the production MangoES and test MangoES?
(3) If we are unable to find a root cause, how can I get the production device to handle OOM like the test device?
Edits made for clarity and additional differences between production and test devices.
That reads like you moved a .zip 'backup' from the Mango/backup directory into the Mango/databases directory on another Mango. The zip files in Mango/backup for core-database-....zip are SQL dumps and need to be restored.
To clarify: I used the USB backup utility to create a database backup while on site. From the backup folder from the USB stick I copied the mah2.h2.db file to a local machine and use scp to copy it to /opt/mango/database/ on the test MangoES.
As the client site is basically half way around the world I will have to request them to find and forward the env.properties file for me to find the appropriate
Where is the env.properties file located? I think I am looking at /opt/mango/databases/reports/db.properties (i.e. the wrong thing).Edit: found it in both <mango>/classes/env.properties and <mango>/overrides/properties/env.properties
I am moving forward with the troubleshooting process. I have some new information but some of this brings more questions than answers.
JVMfreemem data update
The chart below shows the JVMfreemem datapoint of the production system over a 24 hour period sent by the client, from approximately 4:30PM yesterday . What I believe we are seeing here us the rotating slide deck in operation until 1AM where a cron job restarts the mango service. When the service is restarted, the connection is lost to the computer displaying the slides until about 8:30AM when a staff member comes in and refreshes the page. From there the slides continue to rotate and the previous memory usage pattern (and downward trend)
Any other insights or observations that can be gained from this data from anyone with more experience?
There is only one machine on the network that connects to the Mango UI or otherwise can make html requests. Slides are rotated with the
<ma-now>method found in the docs. Is there a method that could force a continual retry of page rotation in the event of disconnection? This may be an acceptable solution to the client in the absence of memory leak root cause analysis.
I am still trying to replicate this memory usage pattern on a test MangoES device on my bench but have been thus far unable to load the H2 SQL database from above (see below).
In my reply yesterday I neglected to mention that the H2 database file made from an on-site backup is ~200MB. I do not know if this constitutes a "large" database, but for comparison the device that was on the test device is only ~27MB. Is 200 MB "large, and where could this extra data have come from? The device underwent a number of configuration changes over approximately a week before going into production.
To attempt to replicate the issue, I am trying to load the large backup SQL database onto a local testing MangoES by direct copy of the mah2.h2.db file. However when I copy the backed up file into the /opt/mango/databases/ directory, I get
Could not get JDBC Connection; nested exception is org.h2.jdbc.JdbcSQLException: Wrong user name or password [28000-181]on mango service restart. How can I find and set the H2 database user/passwords when transferring databases between MangoES devices?
Thanks for the info. I'll try to keep the various troubleshooting topics together under headings in my reply below. TLDR: I'm going to track down hs_err logs and am still waiting on on-site JVMfreemem data, but am attempting to replicate conditions on a test MangoES in the meantime.
I would assume Java 8 as it is running mango v3. This is a stock MangoES, so the Java version is have been unchanged from what ships the factory. Serial number is 3436
I pulled a number of logs from the MangoES before leaving site, but it looks like hs_err* were not among them. If the hs_err* log would be placed in the Mango home folder (/opt/mango/logs/) this means it should be available in the UI Administration > System Status > Logging Console. I will ask for a downloaded copy.
Interesting article, but I agree this is not the correct solution for production.
Device is certainly running NoSQL database, the size was <20MB when I left site. This was the first week post commissioning, so there is minimal data storage yet. For my own understanding, does the H2 SQL database store device and data point configuration while NoSQL is for time series data?
The USB backup utility saved to 'backup' folder: mah2.h2.db files and folders for mangoTSDB and mangoTSDBAux. Good to know I can just stop the mango service and write overwrite the files in /opt/mango/databases/mangoTSDB, then restart. I am assuming the same is true with the mah2.h2.db blob. Would I just have to zip this file if I wanted to use the UI for SQL migration?
Is there any advantage to loading NoSQL database through the UI as opposed to overwriting the files directly?
Thanks for this. If I don't have to use now, it's going in the toolbox for later. I was wondering what best practices were in terms of bulk system edits -- this is a straightforward solution.
I am digging a little bit deeper into this issue as some of the side effects of nightly mango service restarts are not acceptable to the client (loss of connection with slide display, requirement to manually log in as admin every morning).
Unfortunately I no long have access to the system but am attempting to replicate the issues on a test MangoES with the same software version. Currently, I have loaded the configuration from the field device via JSON import, but have not been able to replicate the memory leak issue.
To me this actually helps to reduce the troubleshooting space. The slides (sans images) and data points are now configured but memory allocation and garbage collection seem to be rock steady. Slide deck started at about 0630 in this test run:
In my mind, this leave the leaky culprits to be potentially be one of:
(a) Modbus polling to data points on field devices.
(b) Meta data point summation/averaging on existing data.
(c) Large images loading into the rotating slide deck [though my intuition says probably not this].
I will be getting actual field data of JVMfreemem from the site in the next few days to check if the issue has persisted now that configuration changes have stopped.
In the meantime, I would like to chase down the meta data point lead, but need to load in past data or some numbers for those points to crunch. I made a backup of the database with the USB stick before leaving site but had to leave the USB utility at the facility. Is there any way to manually load the database backup on to a running Mango? Or, is there a way I could fabricate data for the meta point's contextual references (that are "concrete" data points)?
(1) I did not think anything in there should be too severe - mostly 12 hour histories of analog values polled at 15 second intervals, rolled up to 15 minute averages. Worst case is a meta point script that used
<point>.past(DAY,1).summethod for one-minute logged point , so ~1400 values to calculate each context update (i.e. a running totalizer calculated every minute).
(2) Event logs for dropped connections show: and
(a)TCP keep alive configured for data source:
Exception from modbus master: No recipient was found waiting for response for key com.serotonin.modbus4j.ip.xa.XaWaitingRoomKeyFactory$XaWaitingRoomKey@851f
(b)TCP configured for data souce:
com.serotonin.modbus4j.exception.ModbusTransportException: java.net.SocketTimeoutException: connect timed out'
Data source with xid: pm2 and name: Power Meter 2 aborted a poll, check its settings.
Overall, thanks for the succinct info! Between Adding 300MB in the
//Mango/bin/ext-enabled/memory-small.shscript and adding a RESTART file to the mango parent directory everything should run smoothly from the user's point of view. I will have to spend some time bench-marking individual display slides at a later date.
I am using a MangoESv3 with software version 3.2.2 for a modest system of ~200 modbus TCP data points across a dozen data sources. Since installing the device on site and connecting field devices I have experienced the jvm being killed by the OS for lack of memory allocation as reported in dmesg. The mango processes is not automatically restarted when this occurs and the board must be either power cycled or the mango service manually restarted via ssh.
There are two confounding issues here:
(1) memory leakage seems to only occur when a set of rotating html pages are displayed in a lobby area. the pages contain a few point values, serial charts, and images (nothing exotic). Pages rotate with the <ma-now> method recommended in the docs. See image below of JVMfreemem data point, restarts are apparent:
(2) There are sporadic network outages with dropped connections between the MangoES, a PC running the display, and field devices (approx once every 10 min).
(1) Would the need to re-establish those connections contribute to memory consumption?
(2) What could be the underlying causes of the memory leak and why does the mango service not restart itself?
(3) I could create a cron job to restart the mango service every night but that seems like a hotfix at best. Can i create a cronjob as sudo for user 'mango' but not root?