A more concise and effective extractor for excel and csv files #20625

HaiyangPeng · 2025-06-04T05:01:44Z

Important

Make sure you have read our contribution guidelines
Ensure there is an associated issue and you have been assigned to it
Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

As stated by #20602 , the current document extractor of excel and csv will output excessive spaces and format the multi-line content within a single cell into a layered layout. This kind of output text is difficult to input into an LLM for subsequent analysis and processing, and it also consumes a large number of unnecessary tokens.
Based on the above issue, I reconstruct the _extract_text_from_excel and _extract_text_from_csv functions in api/core/workflow/nodes/document_extractor/node.py to parse excel and csv files in a more concise and effective manner. Besides, the two modified functions can output identical texts. This PR fixes #20602.

Screenshots

Here, I give the optimized results for excel and csv files. The original file is displayed blow:
md_example.xlsx
md_example.csv

Optimized results:

mypy check

Passed

Checklist

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

a more concise and effective extractor for excel and csv files.

HaiyangPeng · 2025-06-04T09:44:19Z

Hi @crazywoola , since I have change the current markdown output format of excel parsing to a more concise way, I have not passed the pytest of API. Under this situation, what need I do to pass it, or just remove all the excel related testing? Looking forward to your reply.

crazywoola · 2025-06-04T11:40:38Z

You need to fix the broken tests and adding some new tests for the newly aded feature.

For details, please refer https://github.com/langgenius/dify/actions/runs/15434006394/job/43438681091?pr=20625

HaiyangPeng · 2025-06-05T01:43:48Z

You need to fix the broken tests and adding some new tests for the newly aded feature.

For details, please refer https://github.com/langgenius/dify/actions/runs/15434006394/job/43438681091?pr=20625

Thanks for your help, I will add new testing scripts for _extract_text_from_excel.

add new testing scripts for _extract_text_from_excel.

HaiyangPeng · 2025-06-05T04:10:12Z

@crazywoola I have added corresponding testing scripts for _extract_text_from_excel based on the previous ones and passed the pytest.

crazywoola · 2025-06-05T04:12:25Z

https://github.com/langgenius/dify/actions/runs/15458127948/job/43514165654?pr=20625

Still not working.

HaiyangPeng · 2025-06-05T05:24:04Z

https://github.com/langgenius/dify/actions/runs/15458127948/job/43514165654?pr=20625

Still not working.

@crazywoola This is caused by erroneous testing cases, and I have corrected them in the latest changes.

…enius#20625) Co-authored-by: haiyangpengai <xxxx>

haiyangpengai added 3 commits June 4, 2025 10:03

a more concise and effective extractor for excel and csv files.

ce0910d

Merge branch 'main' of github.com:HaiyangPeng/dify into main

fed0568

a more concise and effective extractor for excel and csv files.

fix bugs generated by mypy.

fd2b9fe

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 📚 documentation Improvements or additions to documentation labels Jun 4, 2025

haiyangpengai and others added 4 commits June 5, 2025 11:48

add testing scripts for extract_text_from_excel.

9da5a2a

Merge branch 'langgenius:main' into main

3edb235

add testing scripts for extract_text_from_excel.

f2a45af

Merge branch 'main' of github.com:HaiyangPeng/dify into main

869c5d4

add new testing scripts for _extract_text_from_excel.

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 5, 2025

add testing scripts for extract_text_from_excel.

7ea880d

crazywoola approved these changes Jun 5, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 5, 2025

crazywoola merged commit 3fb9b41 into langgenius:main Jun 5, 2025
6 checks passed

laipz8200 mentioned this pull request Jun 11, 2025

chore(package): Bump version to 1.4.2 #20897

Merged

dosubot bot mentioned this pull request Jun 24, 2025

Doc Extractor, xlsx files get an empty result #21397

Closed

5 tasks

quicksandznzn mentioned this pull request Jun 24, 2025

fix(document_extractor): xlsx file column int type error #21408

Merged

5 tasks

jsincorporated pushed a commit to jsincorporated/asaAi that referenced this pull request Jul 8, 2025

A more concise and effective extractor for excel and csv files (langg…

48ba827

…enius#20625) Co-authored-by: haiyangpengai <xxxx>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A more concise and effective extractor for excel and csv files #20625

A more concise and effective extractor for excel and csv files #20625

Uh oh!

HaiyangPeng commented Jun 4, 2025 •

edited

Loading

Uh oh!

HaiyangPeng commented Jun 4, 2025

Uh oh!

crazywoola commented Jun 4, 2025

Uh oh!

HaiyangPeng commented Jun 5, 2025

Uh oh!

HaiyangPeng commented Jun 5, 2025

Uh oh!

crazywoola commented Jun 5, 2025

Uh oh!

HaiyangPeng commented Jun 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

A more concise and effective extractor for excel and csv files #20625

A more concise and effective extractor for excel and csv files #20625

Uh oh!

Conversation

HaiyangPeng commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Screenshots

mypy check

Checklist

Uh oh!

HaiyangPeng commented Jun 4, 2025

Uh oh!

crazywoola commented Jun 4, 2025

Uh oh!

HaiyangPeng commented Jun 5, 2025

Uh oh!

HaiyangPeng commented Jun 5, 2025

Uh oh!

crazywoola commented Jun 5, 2025

Uh oh!

HaiyangPeng commented Jun 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HaiyangPeng commented Jun 4, 2025 •

edited

Loading