#zincobserve

Vector Crashes with Timeouts during ZO Ingestion

TLDR Gaby experienced Vector crashes and timeouts during ZO ingestion. Prabhat suggested disabling alerts, and Gaby found better performance without alerts and by adjusting batch and buffer settings.

Powered by Struct AI
👍 2
58
1w
Solved
Join the chat
May 20, 2023 (1 week ago)
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:02 PM
I'm testing sending 100million events to ZO from Vector but Vector crashes with timeouts talking to ZO. The WebUI also becomes very unresponsive. It might be worth adding Batch and Buffer settings to the Vector config created by ZO UI.
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
02:04 PM
Any logs about ZO, like error or crashed? and your event is logs or metrics?
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:06 PM
0 errors on ZO side, just Vector crashing with a timeout talking to ZO. I will write a setup this weekend to replicate this and share it here
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
02:07 PM
okay, 100 million logs, what the data size in ZO at the end?
02:08
Hengfei
02:08 PM
and what is your node resource, cpu and memory
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:08 PM
Compressed around ~800MB, ingested around 50GB
02:08
Gaby
02:08 PM
Intel 12core, with 48GB of ram
02:09
Gaby
02:09 PM
I had several instances where ram usage went to 100% during ingestion
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
02:10 PM
In how much time did you send all this data?
02:11
Prabhat
02:11 PM
E.g. Over a period of 10 minutes
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:12 PM
Prob 20-30 mins, it was pushing data as fast as vector could. It would crash and restart and keep pushing data
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
02:12 PM
Also, did you have any ingest functions?
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:13 PM
I did a test sending 100GB at once using, and that never succeeded
02:13
Gaby
02:13 PM
0 functions, only 1 Alert
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
02:13 PM
Prabhat only 50GB data it should use so many memory.
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
02:13 PM
Realtime alert?
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:13 PM
Yes
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
02:14 PM
Gaby How do you generate logs, What is the generate speed?
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:15 PM
A mix of a few hosts with verbose syslog + vector demo logs
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
02:17 PM
Should not use this much memory. We have added way too many things in the ingest path. We have anyways planned to review it starting next week.
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:17 PM
You can use multiple instances of this to generate a ton of logs:

https://vector.dev/docs/reference/configuration/sources/demo_logs/
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
02:18 PM
Yeah, have used it once
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:18 PM
Just set interval to 0.0 and add 5-10 instances of it
02:18
Gaby
02:18 PM
And 1 Sink to ZO
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
02:19 PM
Do you run it in k8s?
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:20 PM
Just plain Docker
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
02:58 PM
Also, where are you running it? Aws, gcp?
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
02:58 PM
Local server
02:58
Gaby
02:58 PM
No cloud
👍 1
May 21, 2023 (1 week ago)
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
12:36 AM
From the calculations for compute perspective in this case (Let's make some reasonable assumptions doing this) we have 10 cpu cores (Let's leave 2 cores for other work being done) ingesting 50 GB in 20 minutes. This gives us ~4 MB/Sec/Core ingestion speed together with alerts. Our last tests showed us that we could do 10-15 MB/Sec/core when we did not have alerts. Could alerts have reduced the performance by 50%?
12:36
Prabhat
12:36 AM
Gaby Is it possible for you to try disabling alert while doing this ingestion?
12:37
Prabhat
12:37 AM
I think it might be the alerts that is causing the extra CPU utilization.
12:37
Prabhat
12:37 AM
We will still need to investigate memory utilization.
12:41
Prabhat
12:41 AM
Gaby Was vector running on the same machine generating test data?
12:52
Prabhat
12:52 AM
To get a better perspective on ZincObserve ingestion performance, we need to have vector running on a different machine.
12:57
Prabhat
12:57 AM
If ZincObserve was getting only 6 CPUs instead of 10 and also had alerts then, we may be still in green and performance is not really bad. But let's get some details before confirming one way or other.
04:06
Prabhat
04:06 AM
at 4 MB/sec/core we should be able to ingest 4 TB on this machine in 24 hours at current situation and 10 TB if we can get to 10MB/sec/core
04:25
Prabhat
04:25 AM
Let's try to smooth out the issues.
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
05:40 AM
Yes, Vector is running on the same machine
05:41
Gaby
05:41 AM
It makes sense the realtime alerts to be causing the issue if that Query has to run on each batch of events ingested
May 22, 2023 (1 week ago)
Prabhat
Photo of md5-23052f31f8f3c4b1bb3297fbc3a2aec5
Prabhat
02:48 AM
Yeah, alerts can be expensive.
May 23, 2023 (1 week ago)
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
03:24 AM
I turned off alerts in one of my ZO instances, and performance seems way better. Another thing I did was set the Batch in Vector to 15k events, and the Buffer to 5 million with a timeout of 300secs. Instead of alerts what I did was a simple dashboard with the same query that can be checked once in a while. I may do hourly scheduled alerts in the future.
03:25
Gaby
03:25 AM
Without the batch setting, you can run into a case where Vector gets for example 100MB and tries to send all 100MB at once to ZO. Given that ZO has a json size limit this will fail. I have increased my limit to 2x the default number
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
03:27 AM
Thanks let us know your progress
Ashish
Photo of md5-9ed257a93c49bf4a991f872cc2ea4cda
Ashish
11:26 AM
Hi Gaby
11:27
Ashish
11:27 AM
I tested with 7 vector instances running on single c6i.2xlarge
11:27
Ashish
11:27 AM
Image 1 for
11:28
Ashish
11:28 AM
I have 2 ingestor running on two separate c6i.2xlarge
11:28
Ashish
11:28 AM
ingestors havent restarted or killed bacause of load
11:29
Ashish
11:29 AM
however vector has restrted quite a few times
11:29
Ashish
11:29 AM
with OOM
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
12:00 PM
I had that issue happen too, Vector started restarting every once in a while
👍 1
May 24, 2023 (1 week ago)
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
07:05 AM
Gaby What is you config for demo logs? what is the log format set?
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
12:23 PM
Hengfei

[sources.generate_syslog]
type = "demo_logs"
format = "syslog"
count = 100

[transforms.remap_syslog]
inputs = [ "generate_syslog"]
type = "remap"
source = '''
  structured = parse_syslog!(.message)
  . = merge(., structured)
'''
12:25
Gaby
12:25 PM
Change the count to a higher number, and add the same source multiple times with diff names
12:26
Gaby
12:26 PM
And set interval = 0.0
Hengfei
Photo of md5-dca3f47ab6a9286f3ab112d9b480b793
Hengfei
12:27 PM
okay, as Ashish test, ZincObserve has no problem, the first problem is when we create 10 vector instances, the vector cost too much memory.
Gaby
Photo of md5-540a8e08ce1c199c4efaeb0388742259
Gaby
12:39 PM
Fair enough, i havent seen the issue since using Batch and Buffer with the HTTP sink of Vector