Skip to content

Minion Batch ingestion scheduling bottleneck #11282

@t0mpere

Description

@t0mpere

Hello, I've tried to debug why scheduling SegmentGenerationAndPushTask Minion jobs take so long to schedule and I've narrowed it down the problem to this part of the code.

JobConfig.Builder jobBuilder =
new JobConfig.Builder().addTaskConfigs(helixTaskConfigs).setInstanceGroupTag(minionInstanceTag)
.setTimeoutPerTask(taskTimeoutMs).setNumConcurrentTasksPerInstance(numConcurrentTasksPerInstance)
.setIgnoreDependentJobFailure(true).setMaxAttemptsPerTask(1).setFailureThreshold(Integer.MAX_VALUE)
.setExpiry(_taskExpireTimeMs);
_taskDriver.enqueueJob(getHelixJobQueueName(taskType), parentTaskName, jobBuilder);
// Wait until task state is available
while (getTaskState(parentTaskName) == null) {
Uninterruptibles.sleepUninterruptibly(100, TimeUnit.MILLISECONDS);
}
return parentTaskName;

I'm currently use POST /tasks/execute API to schedule the job.
The culprit seems to be the while loop waiting for the task to get a state. I'm not familiar on how helix handles this in the background. Do you think it would be possible to avoid looping on synchronized getTaskState() and maybe implement a callback to get the result of a job scheduling.
This is a big deal for us since scheduling takes more than ingestion and doesn't allow to keep up with new data and scale.
It might also be a misconfiguration problem but in this case I will need your help to find it.

Current configuration:
GKE
version 0.12.1
GCS for deep storage
3 ZK - 8 CPU and 18GB ram
6 Servers - 16CPU and 32 64GB ram 1.45TB SSD
2 Controllers - 16 CPU and 32GB ram
2 Brokers - 5 CPU 16.25GB ram
32 Minions - 2 CPU and 2GB of ram

1M Segments 4TB of data

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions