-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rudderstack becomes extremely slow when you have one destination down #4953
Comments
Can you share what response do you see for |
@fackyhigh can you please also share the config value for |
Hello @gitcommitshow! Here is:
BTW I forgot to send you a config: maxProcess: 12
gwDBRetention: 0h
routerDBRetention: 0h
enableProcessor: true
enableRouter: true
enableStats: true
statsTagsFormat: influxdb
Http:
ReadTimeout: 0s
ReadHeaderTimeout: 0s
WriteTimeout: 10s
IdleTimeout: 720s
MaxHeaderBytes: 524288
RateLimit:
eventLimit: 1000
rateLimitWindow: 60m
noOfBucketsInWindow: 12
Gateway:
webPort: 8080
maxUserWebRequestWorkerProcess: 128
maxDBWriterProcess: 512
CustomVal: GW
maxUserRequestBatchSize: 128
maxDBBatchSize: 128
userWebRequestBatchTimeout: 15ms
dbBatchWriteTimeout: 5ms
maxReqSizeInKB: 4000
enableRateLimit: false
enableSuppressUserFeature: true
allowPartialWriteWithErrors: true
allowReqsWithoutUserIDAndAnonymousID: false
webhook:
batchTimeout: 20ms
maxBatchSize: 32
maxTransformerProcess: 64
maxRetry: 5
maxRetryTime: 10s
sourceListForParsingParams:
- shopify
EventSchemas:
enableEventSchemasFeature: false
syncInterval: 240s
noOfWorkers: 128
Debugger:
maxBatchSize: 32
maxESQueueSize: 1024
maxRetry: 3
batchTimeout: 2s
retrySleep: 100ms
SourceDebugger:
disableEventUploads: false
DestinationDebugger:
disableEventDeliveryStatusUploads: false
TransformationDebugger:
disableTransformationStatusUploads: false
Archiver:
backupRowsBatchSize: 100
JobsDB:
jobDoneMigrateThres: 0.8
jobStatusMigrateThres: 5
maxDSSize: 100000
maxMigrateOnce: 10
maxMigrateDSProbe: 10
maxTableSizeInMB: 300
migrateDSLoopSleepDuration: 30s
addNewDSLoopSleepDuration: 5s
refreshDSListLoopSleepDuration: 5s
backupCheckSleepDuration: 5s
backupRowsBatchSize: 1000
archivalTimeInDays: 3
archiverTickerTime: 1440m
backup:
enabled: false
gw:
enabled: true
pathPrefix: ""
rt:
enabled: true
failedOnly: true
batch_rt:
enabled: false
failedOnly: false
Router:
jobQueryBatchSize: 10000
updateStatusBatchSize: 1000
readSleep: 1000ms
fixedLoopSleep: 0ms
noOfJobsPerChannel: 1000
noOfJobsToBatchInAWorker: 20
jobsBatchTimeout: 5s
maxSleep: 60s
minSleep: 0s
maxStatusUpdateWait: 5s
useTestSink: false
guaranteeUserEventOrder: true
kafkaWriteTimeout: 2s
kafkaDialTimeout: 10s
minRetryBackoff: 10s
maxRetryBackoff: 300s
noOfWorkers: 64
allowAbortedUserJobsCountForProcessing: 1
maxFailedCountForJob: 3
retryTimeWindow: 180m
failedKeysEnabled: false
saveDestinationResponseOverride: false
responseTransform: false
MARKETO:
noOfWorkers: 4
throttler:
MARKETO:
limit: 45
timeWindow: 20s
BRAZE:
forceHTTP1: true
httpTimeout: 120s
httpMaxIdleConnsPerHost: 32
BatchRouter:
mainLoopSleep: 2s
jobQueryBatchSize: 100000
uploadFreq: 30s
warehouseServiceMaxRetryTime: 3h
noOfWorkers: 8
maxFailedCountForJob: 128
retryTimeWindow: 180m
datePrefixOverride: "YYYY-MM-DD"
Warehouse:
mode: embedded
webPort: 8082
uploadFreq: 1800s
noOfWorkers: 8
noOfSlaveWorkerRoutines: 4
mainLoopSleep: 5s
minRetryAttempts: 3
retryTimeWindow: 180m
minUploadBackoff: 60s
maxUploadBackoff: 1800s
warehouseSyncPreFetchCount: 10
warehouseSyncFreqIgnore: false
stagingFilesBatchSize: 960
enableIDResolution: false
populateHistoricIdentities: false
redshift:
maxParallelLoads: 3
setVarCharMax: false
snowflake:
maxParallelLoads: 3
bigquery:
maxParallelLoads: 20
postgres:
maxParallelLoads: 3
mssql:
maxParallelLoads: 3
azure_synapse:
maxParallelLoads: 3
clickhouse:
maxParallelLoads: 3
queryDebugLogs: false
blockSize: 200000
poolSize: 10
disableNullable: true
enableArraySupport: false
Processor:
webPort: 8086
loopSleep: 10ms
maxLoopSleep: 5000ms
fixedLoopSleep: 0ms
maxLoopProcessEvents: 10000
transformBatchSize: 100
userTransformBatchSize: 200
maxConcurrency: 200
maxHTTPConnections: 100
maxHTTPIdleConnections: 50
maxRetry: 30
retrySleep: 100ms
timeoutDuration: 30s
errReadLoopSleep: 30s
errDBReadBatchSize: 1000
noOfErrStashWorkers: 2
maxFailedCountForErrJob: 3
Stats:
# have event name as label in prometheus metrics
captureEventName: true
dbReadBatchSize: 50000
maxChanSize: 8192
Dedup:
enableDedup: false
dedupWindow: 900s
BackendConfig:
configFromFile: false
configJSONPath: /etc/rudderstack/workspaceconfig/workspaceConfig.json
pollInterval: 5s
regulationsPollInterval: 300s
maxRegulationsPerRequest: 1000
recovery:
enabled: true
errorStorePath: /tmp/error_store.json
storagePath: /tmp/recovery_data.json
normal:
crashThreshold: 5
duration: 300s
Logger:
enableConsole: true
enableFile: false
consoleJsonFormat: false
fileJsonFormat: false
logFileLocation: /tmp/rudder_log.log
logFileSize: 100
enableTimestamp: true
enableFileNameInLog: true
enableStackTrace: false
Diagnostics:
enableDiagnostics: true
gatewayTimePeriod: 60s
routerTimePeriod: 60s
batchRouterTimePeriod: 6l
enableServerStartMetric: true
enableConfigIdentifyMetric: true
enableServerStartedMetric: true
enableConfigProcessedMetric: true
enableGatewayMetric: true
enableRouterMetric: true
enableBatchRouterMetric: true
enableDestinationFailuresMetric: true
RuntimeStats:
enabled: true
statsCollectionInterval: 10
enableCPUStats: true
enableMemStats: true
enableGCStats: true
PgNotifier:
retriggerInterval: 2s
retriggerCount: 500
trackBatchInterval: 2s |
Regarding |
Config seems correct. For event sync lag time, @fackyhigh can you please check the sync lag for different tl;dr: Aggregate the sync lag time data by |
Hello @gitcommitshow. I'm not sure that I've understood you. I have a couple questions:
All graph lines are the same. Is there any other metrics? I've experimented with To be clear we have 7 destination with a type of WEBHOOK. And those destinations are only 3 services. 2 out of 3 services has 3 API endpoint for Android, iOS and Web events. So 2 destination with 3 endpoints each and 1 destination with one endpoint. |
The incoming events are queued, and removed from the queue when they are delivered to the destination. The pending events that need to be delivered is the backlog.
Both the graph/screenshots you shared in the original post |
@gitcommitshow, hello. Is there any about the problem info? I've provided the information you requested. As well you haven't answered these questions:
|
Allow me some more time to get back to you on the pending questions |
Received some inputs from the team as following
All the 429s, 408s and 5xx status codes are retried, rest all the jobs are aborted. 5xx errors are retried for 180 mins and 3 attempts(as per the config shared) and events will be aborted after that. 429 and 408 are retried until any terminal status code is returned.
webhook destination might be returning 429s, 408s or 5xx status code and since these status codes are retried new events would not be processed(to maintain event ordering) resulting in waiting state.
we can check We can't do much if destination is slow as we have to keep storing events until destination is up again so that we can deliver the events. |
@gitcommitshow And still we have RT tables growth every day: And the strange thing is that we have 80 rudder-server pods, but we see tables growth only on specific pods: We always have this problem with RT growth only on small subset of the whole pod pool. |
Hey there @gitcommitshow. Are there any new hypotheses? I have a couple of questions more:
|
Hi @fackyhigh , thank you for sharing specific questions. I got some more inputs from the team as following Can you please share the last statuses per job in the first
doesn't matter since you said there's no errors in the last 30 days
it's possible there's a heavy skew in the userIDs of your events - which could also lead to some slowdown as you've noted in some other point talking about event ordering
it could, but only if there's a lot of events from a few userIDs you could also try increasing the router workers(more workers pushing out events concurrently) - increase also |
It's evening now, so a traffic spike goes down and RT tables count goes down as well. So, for now, we have only executing and succeeded statuses, but we'll see what happens tomorrow. Can you clarify only one thing? What is the UserID? Is this a part of event or what? Is this something that we can handle or manage? With rest of your hints I'll experiment and get back to you. Have a great day and thank you for the lots of useful information!! |
Yes, it is part of the event data. To be more specific, it is either the Reference - https://www.rudderstack.com/docs/event-spec/standard-events/identify/#user-id-vs-anonymous-id |
Describe the bug
We have data-plane running in k8s. There are 60 pods. When we see in Grafana that one of our Webhook destinations is down, then Rudderstack becomes extremely slow. Webhook delivery time increases dramatically, rt tables count increases from 2 to ~30 per PostgreSQL pod, webhook event sync lag time goes from 10 second to one hour almost.
Steps to reproduce the bug
Enter the steps to reproduce the behavior.
Expected behavior
When destination is down the system is still running fast and it doesn't affect other destinations.
Screenshots
Any additional context
Rudderstack version is 1.28.1
Please, tell us what to tweak so Rudderstack could work as usual at the times when one destination may go down. As well It'l be appreciated if you share how the retry logic actually works and why it affects other destinations.
The text was updated successfully, but these errors were encountered: