How to store tabular data with a context #16630
Replies: 3 comments 1 reply
-
To store tabular data with context in HTML documents for accurate representation in LlamaIndex, you can use the Here's a basic example of how you can use the from llama_index.legacy.node_parser.relational.unstructured_element import UnstructuredElementNodeParser
from llama_index.legacy.schema import Document, IndexNode, TextNode
# Sample HTML document with tables and text
test_data = Document(
text="""
<!DOCTYPE html>
<html>
<body>
<table>
<tr>
<td>Year</td>
<td>Benefits</td>
</tr>
<tr>
<td>2020</td>
<td>12,000</td>
</tr>
</table>
<p>Test paragraph</p>
</body>
</html>
"""
)
# Initialize the node parser
node_parser = UnstructuredElementNodeParser()
# Extract nodes from the document
nodes = node_parser.get_nodes_from_documents([test_data])
# Check the types of nodes extracted
for node in nodes:
print(type(node)) This code will parse the HTML content and extract nodes, ensuring that both tabular and textual data are accurately represented. The |
Beta Was this translation helpful? Give feedback.
-
Looking briefly into the UnstructuredElementNodeParser code, it detects tables and creates summaries for them, but I am not sure it keeps the original cell values. If someone asks for a specific parameter, i.e. cell value, combining row and column titles, can my RAG answer? |
Beta Was this translation helpful? Give feedback.
-
According to the example notebook for ChainOfTableQueryEngine, it can be used if the source is just the tabular structure, not the mix of text and tables. It looks like my tables would have to be extracted into another storage and be queried differently than the main content, and result nodes somehow combined afterward. It is also unclear why there are both HTMLNodeParser and UnstructuredElementNodeParser when the latter can handle HTML tables better. |
Beta Was this translation helpful? Give feedback.
-
My HTML documents consist of text and tables. Currently, all the text is extracted, and the final plain text is loaded into LlamaIndex. This means cell data are merged, column by column, row by row, ending up as total nonsense, losing the original context. Would it make sense to wrap tabular data with markdown markup instead (and import it as plain text into the LlamaIndex)? I hope vector search could still work, however, when passing top nodes to AI to synthesize the answer, this context could be understood better.
Are there better ways to handle the mix of text and tabular data?
Beta Was this translation helpful? Give feedback.
All reactions