Skip to content

feat: provide a new AQE-friendly ConnectedComponents impl without breaking the old one #759

@ericsun95

Description

@ericsun95

Describe the bug
Hey I experienced a significant performance drop when updating from 0.9.4 to the latest version on cc computation. The first iteration of the cc is little bit slower while for consecutive iteration, the time can increase from several mins to around half an hour. Checked from spark ui, the cpu usage of each executor is almost 0 while super high for the driver. For one stage all the executor can finish the task in seconds while the total time can be half an hour. This might be the algorithm updates or the updates from writing to parquet to the checkpoint.

To Reproduce

Steps to reproduce the behavior:

  1. ...
  2. ...
  3. ...

Expected behavior

System [please complete the following information]:

  • OS: e.g. [Ubuntu 18.04]
  • Python Version (if applied): [e.g. Python 3.8]
  • Spark / PySpark version: [e.g. PySpark 3.5.1] Spark 3.5.4
  • GraphFrames version: [e.g. graphframes-0.9.0]

Component

  • Scala Core Internal
  • Scala API
  • Spark Connect Plugin
  • PySpark Classic
  • PySpark Connect

Additional context

Are you planning on creating a PR?

  • I'm willing to make a pull-request

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions