#zincobserve

Troubleshooting High CPU and Memory Usage in Zinc Service

TLDR Zygimantas experienced high CPU and memory usage in zinc service while querying large datasets. Hengfei offered suggestions to optimize and test for local disk and S3 usage. Gaby and Prabhat discussed SIMD tag performance.

Powered by Struct AI
๐Ÿ™Œ 2
โœ… 1
๐Ÿ‘Œ 1
๐Ÿ˜… 1
May 19, 2023 (1 week ago)
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:19 AM
cpu was at 97%
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:20 AM
when query we will try to use all cpu core, the cpu should 100%.
08:20
Hengfei
08:20 AM
after more data, need more time. maybe seconds, minutes depends on you cpu speed and your data scale.
08:21
Hengfei
08:21 AM
how much data do you have about 3 days. 10GB?
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:21 AM
i see the the issue, OOM Killed zinc service ๐Ÿ˜„
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:21 AM
OOM, should because of memory used too much. default we try to load all files (in the time range, like: 3days) into memory to search.
08:22
Hengfei
08:22 AM
So, what is you data size, you can it in the UI, streams page.
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:22 AM
Ingested Data 552157.6 MB
08:22
Zygimantas
08:22 AM
2023-05-15T09:50:52:50+03:00
to
2023-05-19T11:22:25:08+03:00
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:22 AM
how many after compress
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:23 AM
Compressed Size 38,210 MB
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:23 AM
4GB
08:23
Hengfei
08:23 AM
when you query do you use function?
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:23 AM
not yet, i was just hoping to query past few days
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:24 AM
we suggest you should have 8GB memory for query this data scale. or try to reduce your queried time range.
08:25
Hengfei
08:25 AM
How much memory do you give to the zincobserve pod?
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:25 AM
๐Ÿ˜„ it has 64gb of ram
08:25
Zygimantas
08:25 AM
query topped up all ram and cpu
08:26
Zygimantas
08:26 AM
Image 1 for
โœ… 1
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:26 AM
64GB, query this data scale shouldn't use all of memory.
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:27 AM
AMD Ryzenโ„ข 5 3600
RAM 64 GB DDR4
Disk:2 x 512 GB NVMe S
08:27
Zygimantas
08:27 AM
running from local ssds for now
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:28 AM
sorry, i look at as 4GB, this is 38GB, Compressed Size 38,210 MB
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:31 AM
And this is only 1/3 of our logs, i think we might just split into 3 different servers
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:33 AM
Can you try set:
ZO_MEMORY_CACHE_MAX_SIZE=20480

set the memory cache limit to 20GB, and then query all data again, to see if it can work.
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
08:35 AM
Sure, after the lunch
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
08:36 AM
Thank you
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
09:21 AM
2 days and then 3 days query
Image 1 for 2 days and then 3 days query
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
09:22 AM
Killed again?
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
09:25 AM
not yet, but all swap gonne
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
09:26 AM
Do you use docker or k8s?
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
09:51 AM
binary on physical server
09:52
Zygimantas
09:52 AM
70% of ram reserved after the query
10:01
Zygimantas
10:01 AM
one more query for 3 days and ZO totally dead
Image 1 for one more query for 3 days and ZO totally dead
10:03
Zygimantas
10:03 AM
query โ€œbytes_count=โ€˜4958โ€™โ€ for 3 days also kills zinc
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
10:04 AM
Actually, as you scenrio, you can disable memory cache, it should also can work.
set
ZO_MEMORY_CACHE_ENABLED=false

or, set a small size:
ZO_MEMORY_CACHE_MAX_SIZE=10240
10:05
Hengfei
10:05 AM
for small size memory cache, if you just query 2 hours, it also can use the cache, but for full data query, it need load files from disk every time.
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
10:09 AM
i do not get any results with 3 day range ๐Ÿ˜•
10:09
Zygimantas
10:09 AM
with any query, for example backend=โ€˜testโ€™
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
10:09 AM
for disable cache?
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
10:09 AM
both ways, with enabled cache and disabled
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
10:11 AM
open chrome develop console, you should saw some error.
or in the server side, you also should see some error.
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
10:12 AM
Image 1 for
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
10:13 AM
504 is timeout.
10:14
Hengfei
10:14 AM
can you check the network, how long it timeout? default we are 600s
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
10:17 AM
yes, 600s
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
10:19 AM
okay...
10:20
Hengfei
10:20 AM
it means query is tooooo slow. i will optimize locak disk query for no enough memory, like disable memory cache.
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
10:22 AM
and itโ€™s just 3 days, we were planning to store one month logs
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
10:24 AM
yes, as my test earlier, the query speed is 2.4GB/s/cpu core. as your CPU, it is a 6-core, 12-thread processor. at least it should 10GB/s, 500GB, should done about 2minutes.
10:25
Hengfei
10:25 AM
10minutes, still have no result. maybe something is not like we except.
10:25
Hengfei
10:25 AM
yes, my test with everything cached in memory.
10:38
Hengfei
10:38 AM
i will do some test again and try to improve it.
๐Ÿ‘Œ 1
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
11:11 AM
Why not use a MemMap?
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
11:11 AM
Yes, we will try mmap.
11:12
Hengfei
11:12 AM
We mainly optimize for s3. for local disk, earlier i think just single instance, should be small user, will not have too much data.
Zygimantas
Photo of md5-098803c03799a20348e45bca57279e86
Zygimantas
11:17 AM
I can migrate to s3 if it will solve the issue
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
11:18 AM
waiting for me do some more test.
๐Ÿ™Œ 1
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
11:22 AM
Yesterday I was trying to query around 100 million events, and it takes around 10-11secs with ZO
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
11:23 AM
Yep, i will do more performance test.
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
11:26 AM
I'm curious how much performance the SIMD tag has compared to the regular one
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
11:29 AM
AMD cpu not support, for intel cpu AVX512 is better.
๐Ÿ™Œ 1
11:29
Hengfei
11:29 AM
or ARM cpu support neon
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
11:32 AM
Is the zincobserve-dev image available with SIMD? I have been using that image since the current latest release doesnt have the fix for the charts from days ago ๐Ÿ˜‚
๐Ÿ˜… 1
11:32
Gaby
11:32 AM
Thanks
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
01:29 PM
SIMD build are unlikely to help in search. They should help in aggregation queries (average, sum, etc). Think everything on dashboards.
01:40
Prabhat
01:40 PM
We will be making a new release soon
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:08 PM
Good to know, thanks!