Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a very important comment for SentenceSplitter #14257

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wencan
Copy link
Contributor

@wencan wencan commented Jun 20, 2024

Description

Add a very important comment for SentenceSplitter

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 20, 2024
@logan-markewich
Copy link
Collaborator

Shouldn't we just fix the logic? Lol

@wencan
Copy link
Contributor Author

wencan commented Jun 20, 2024

@logan-markewich
In a short time, I can't think of a good solution either. However, I believe that this bug is caused by the accumulation of multiple small defects. We can start from the source and optimize the related logic. Maybe I should add a warning output in the code?

@nerdai
Copy link
Contributor

nerdai commented Jun 20, 2024

@wencan maybe for now we can add a check to see if the problem did occur, and if it did then raise the warning?

@logan-markewich
Copy link
Collaborator

@wencan actually, if you have a test case where this happens, I can probably just work backwards from that

@nerdai
Copy link
Contributor

nerdai commented Jun 20, 2024

@wencan actually, if you have a test case where this happens, I can probably just work backwards from that

Yeah this makes sense to me!

@wencan
Copy link
Contributor Author

wencan commented Jun 20, 2024

@logan-markewich @nerdai

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import Document, MetadataMode

text = """你所描述的情况可能与身体健康有关,尤其是与压力、疲劳和动机相关的身体状态。长时间的工作压力和疲劳可能导致身体功能下降,包括记忆力、注意力和决策能力。此外,焦虑和压力可能会影响你的情绪状态和工作表现,从而形成一个恶性循环。
以下是一些可能与你的情况相关的健康概念:
1\. **慢性疲劳**:长时间的工作和缺乏休息可能导致身体的疲劳,这种慢性疲劳可能会影响你的肌肉恢复和整体健康。
2\. **营养不足**:你提到的对工作的忽视可能导致饮食不规律和营养不足,这可能会影响你的体力和精力。
3\. **体能和耐力**:如果你的工作不再给你提供足够的体能锻炼,或者你感觉自己的体能有所下降,这可能会影响你的工作表现。
4\. **自我照顾**:如果你忽视了对身体的照顾,比如不按时吃饭、不运动,可能会导致身体机能的下降。
5\. **应对策略**:你可能会采取一些应对策略来处理工作压力,比如依赖咖啡或能量饮料来提神,或者熬夜来完成工作。
为了应对这些挑战,你可以尝试以下策略:
\- **休息和恢复**:确保你有足够的休息时间,这对于恢复体力和精神状态至关重要。
\- **时间管理和优先级设定**:尝试合理规划你的时间,优先处理最重要的任务。
\- **寻求支持**:和家人、朋友或同事交流你的感受,或者寻求专业的健康咨询。
\- **自我反思**:思考你的生活方式和工作习惯,以及它们是否对你的健康有益。
\- **健康规划**:考虑你的长期健康规划,是否需要调整你的生活方式或寻求更健康的习惯。
\- **身体保健**:如果可能,尝试一些提高身体机能的活动,如瑜伽、太极或其他健身课程。
记住,你的身体健康是生活的基础。如果工作压力和疲劳影响了你的生活质量,那么采取行动来改变这种状况是至关重要的。专业的健康支持可能会对你有所帮助。"""

doc = Document(text=text, extra_info={
    'title': '教育的主要性 教育是人类社会发展的基石',
    'keywords': '教育、 文化、 学习、 人才、 成长、 创造、 未来、 资源、 关注、 才华和潜力'
})

# magic: parser = SentenceSplitter(chunk_size=512, chunk_overlap=64, paragraph_separator='\n')
parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = parser.get_nodes_from_documents([doc])
print([len(node.get_content(MetadataMode.ALL)) for node in nodes])
# output: [441, 537]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants