-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Hello, I've tried to debug why scheduling SegmentGenerationAndPushTask Minion jobs take so long to schedule and I've narrowed it down the problem to this part of the code.
Lines 297 to 309 in 78308da
| JobConfig.Builder jobBuilder = | |
| new JobConfig.Builder().addTaskConfigs(helixTaskConfigs).setInstanceGroupTag(minionInstanceTag) | |
| .setTimeoutPerTask(taskTimeoutMs).setNumConcurrentTasksPerInstance(numConcurrentTasksPerInstance) | |
| .setIgnoreDependentJobFailure(true).setMaxAttemptsPerTask(1).setFailureThreshold(Integer.MAX_VALUE) | |
| .setExpiry(_taskExpireTimeMs); | |
| _taskDriver.enqueueJob(getHelixJobQueueName(taskType), parentTaskName, jobBuilder); | |
| // Wait until task state is available | |
| while (getTaskState(parentTaskName) == null) { | |
| Uninterruptibles.sleepUninterruptibly(100, TimeUnit.MILLISECONDS); | |
| } | |
| return parentTaskName; |
I'm currently use POST /tasks/execute API to schedule the job.
The culprit seems to be the while loop waiting for the task to get a state. I'm not familiar on how helix handles this in the background. Do you think it would be possible to avoid looping on synchronized getTaskState() and maybe implement a callback to get the result of a job scheduling.
This is a big deal for us since scheduling takes more than ingestion and doesn't allow to keep up with new data and scale.
It might also be a misconfiguration problem but in this case I will need your help to find it.
Current configuration:
GKE
version 0.12.1
GCS for deep storage
3 ZK - 8 CPU and 18GB ram
6 Servers - 16CPU and 32 64GB ram 1.45TB SSD
2 Controllers - 16 CPU and 32GB ram
2 Brokers - 5 CPU 16.25GB ram
32 Minions - 2 CPU and 2GB of ram
1M Segments 4TB of data