How to store tabular data with a context #16630

jan-tosovsky-cz · 2024-10-22T08:30:06Z

jan-tosovsky-cz
Oct 22, 2024

My HTML documents consist of text and tables. Currently, all the text is extracted, and the final plain text is loaded into LlamaIndex. This means cell data are merged, column by column, row by row, ending up as total nonsense, losing the original context. Would it make sense to wrap tabular data with markdown markup instead (and import it as plain text into the LlamaIndex)? I hope vector search could still work, however, when passing top nodes to AI to synthesize the answer, this context could be understood better.
Are there better ways to handle the mix of text and tabular data?

@dosu · 2024-10-22T08:32:08Z

dosubot[bot]
bot Oct 22, 2024

To store tabular data with context in HTML documents for accurate representation in LlamaIndex, you can use the UnstructuredElementNodeParser from the LlamaIndex framework. This parser is capable of extracting nodes from HTML documents, including tables, and representing them as IndexNode or TextNode objects. This approach ensures that both tabular and textual data are accurately represented, maintaining the context.

Here's a basic example of how you can use the UnstructuredElementNodeParser:

from llama_index.legacy.node_parser.relational.unstructured_element import UnstructuredElementNodeParser
from llama_index.legacy.schema import Document, IndexNode, TextNode

# Sample HTML document with tables and text
test_data = Document(
    text="""
    <!DOCTYPE html>
    <html>
    <body>
        <table>
            <tr>
                <td>Year</td>
                <td>Benefits</td>
            </tr>
            <tr>
               <td>2020</td>
                <td>12,000</td>
            </tr>
        </table>
        <p>Test paragraph</p>
    </body>
    </html>
    """
)

# Initialize the node parser
node_parser = UnstructuredElementNodeParser()

# Extract nodes from the document
nodes = node_parser.get_nodes_from_documents([test_data])

# Check the types of nodes extracted
for node in nodes:
    print(type(node))

This code will parse the HTML content and extract nodes, ensuring that both tabular and textual data are accurately represented. The UnstructuredElementNodeParser is designed to handle the mix of text and tabular data effectively. If you need to handle more complex scenarios or improve the representation, consider customizing the parser or using additional processing steps to better structure the data before indexing ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

jan-tosovsky-cz · 2024-10-22T09:03:45Z

jan-tosovsky-cz
Oct 22, 2024
Author

Looking briefly into the UnstructuredElementNodeParser code, it detects tables and creates summaries for them, but I am not sure it keeps the original cell values. If someone asks for a specific parameter, i.e. cell value, combining row and column titles, can my RAG answer?

1 reply

dosubot[bot] bot Oct 22, 2024

The UnstructuredElementNodeParser in LlamaIndex is designed to handle table data, but it primarily focuses on extracting and summarizing table elements rather than retaining the original cell values for direct querying. However, LlamaIndex does provide a way to query specific cell values using the ChainOfTableQueryEngine. This engine allows you to query specific information from tables by combining row and column titles.

For example, you can use the ChainOfTableQueryEngine to query a table for specific information like this:

query_engine = ChainOfTableQueryEngine(df, llm=llm, verbose=True)
response = query_engine.query("Who won best Director in the 1972 Academy Awards?")
str(response.response)

This approach allows you to retrieve specific cell values based on user queries, even if the original cell values are not directly retained by the UnstructuredElementNodeParser ^[1].

To continue talking to Dosu, mention @dosu.

jan-tosovsky-cz · 2024-10-22T11:47:28Z

jan-tosovsky-cz
Oct 22, 2024
Author

According to the example notebook for ChainOfTableQueryEngine, it can be used if the source is just the tabular structure, not the mix of text and tables. It looks like my tables would have to be extracted into another storage and be queried differently than the main content, and result nodes somehow combined afterward.

It is also unclear why there are both HTMLNodeParser and UnstructuredElementNodeParser when the latter can handle HTML tables better.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to store tabular data with a context #16630

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to store tabular data with a context #16630

jan-tosovsky-cz Oct 22, 2024

Replies: 3 comments · 1 reply

dosubot[bot] bot Oct 22, 2024

jan-tosovsky-cz Oct 22, 2024 Author

dosubot[bot] bot Oct 22, 2024

jan-tosovsky-cz Oct 22, 2024 Author

jan-tosovsky-cz
Oct 22, 2024

Replies: 3 comments 1 reply

dosubot[bot]
bot Oct 22, 2024

jan-tosovsky-cz
Oct 22, 2024
Author

jan-tosovsky-cz
Oct 22, 2024
Author