-
Notifications
You must be signed in to change notification settings - Fork 19.9k
A more concise and effective extractor for excel and csv files #20625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a more concise and effective extractor for excel and csv files.
|
Hi @crazywoola , since I have change the current markdown output format of excel parsing to a more concise way, I have not passed the pytest of API. Under this situation, what need I do to pass it, or just remove all the excel related testing? Looking forward to your reply. |
|
You need to fix the broken tests and adding some new tests for the newly aded feature. For details, please refer https://github.com/langgenius/dify/actions/runs/15434006394/job/43438681091?pr=20625 |
Thanks for your help, I will add new testing scripts for _extract_text_from_excel. |
add new testing scripts for _extract_text_from_excel.
|
@crazywoola I have added corresponding testing scripts for _extract_text_from_excel based on the previous ones and passed the pytest. |
@crazywoola This is caused by erroneous testing cases, and I have corrected them in the latest changes. |
…enius#20625) Co-authored-by: haiyangpengai <xxxx>
Important
Fixes #<issue number>.Summary
As stated by #20602 , the current document extractor of excel and csv will output excessive spaces and format the multi-line content within a single cell into a layered layout. This kind of output text is difficult to input into an LLM for subsequent analysis and processing, and it also consumes a large number of unnecessary tokens.
Based on the above issue, I reconstruct the
_extract_text_from_exceland_extract_text_from_csvfunctions inapi/core/workflow/nodes/document_extractor/node.pyto parse excel and csv files in a more concise and effective manner. Besides, the two modified functions can output identical texts. This PR fixes #20602.Screenshots
Here, I give the optimized results for excel and csv files. The original file is displayed blow:
md_example.xlsx
md_example.csv
Optimized results:

mypy check
Passed
Checklist
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods