Skip to content

Conversation

@HaiyangPeng
Copy link
Contributor

@HaiyangPeng HaiyangPeng commented Jun 4, 2025

Important

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

As stated by #20602 , the current document extractor of excel and csv will output excessive spaces and format the multi-line content within a single cell into a layered layout. This kind of output text is difficult to input into an LLM for subsequent analysis and processing, and it also consumes a large number of unnecessary tokens.
Based on the above issue, I reconstruct the _extract_text_from_excel and _extract_text_from_csv functions in api/core/workflow/nodes/document_extractor/node.py to parse excel and csv files in a more concise and effective manner. Besides, the two modified functions can output identical texts. This PR fixes #20602.

Screenshots

Here, I give the optimized results for excel and csv files. The original file is displayed blow:
md_example.xlsx
md_example.csv

Optimized results:
2025-06-04 12-28-37 的屏幕截图

mypy check

Passed

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 📚 documentation Improvements or additions to documentation labels Jun 4, 2025
@HaiyangPeng
Copy link
Contributor Author

Hi @crazywoola , since I have change the current markdown output format of excel parsing to a more concise way, I have not passed the pytest of API. Under this situation, what need I do to pass it, or just remove all the excel related testing? Looking forward to your reply.

@crazywoola
Copy link
Member

You need to fix the broken tests and adding some new tests for the newly aded feature.

For details, please refer https://github.com/langgenius/dify/actions/runs/15434006394/job/43438681091?pr=20625

@HaiyangPeng
Copy link
Contributor Author

You need to fix the broken tests and adding some new tests for the newly aded feature.

For details, please refer https://github.com/langgenius/dify/actions/runs/15434006394/job/43438681091?pr=20625

Thanks for your help, I will add new testing scripts for _extract_text_from_excel.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 5, 2025
@HaiyangPeng
Copy link
Contributor Author

@crazywoola I have added corresponding testing scripts for _extract_text_from_excel based on the previous ones and passed the pytest.

@crazywoola
Copy link
Member

@HaiyangPeng
Copy link
Contributor Author

https://github.com/langgenius/dify/actions/runs/15458127948/job/43514165654?pr=20625

Still not working.

@crazywoola This is caused by erroneous testing cases, and I have corrected them in the latest changes.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 5, 2025
@crazywoola crazywoola merged commit 3fb9b41 into langgenius:main Jun 5, 2025
6 checks passed
jsincorporated pushed a commit to jsincorporated/asaAi that referenced this pull request Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📚 documentation Improvements or additions to documentation lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request a more concise and effective extractor for excel and csv files

2 participants