feat: add support for nested fields #1

johnie · 2024-10-26T15:11:13Z

Pull Request Type

Feature
Bug Fix
Refactor
General Change

Summary

This pull request introduces nested field support for HTML data extraction, enhancing the createScraper function to handle nested structures within the data schema. Additionally, improvements were made to the README for better readability and documentation on these new nested field features.

Changes Made

README Enhancements:
- Improved formatting of the feature list for readability.
- Added examples of the newly supported nested field extraction to demonstrate the feature.
Nested Field Support in Types:
- Expanded the FieldDefinition type to support nested field definitions through a new NestedFieldDefinition type. This update allows fields to be defined with sub-fields, enabling the extraction of complex HTML structures.
Helper Function extractData:
- Created a new helper function extractData within createScraper.ts to manage nested field extraction recursively. This function processes each field in the schema, distinguishing between simple and nested fields, and extracting values accordingly.
Update to createScraper:
- Modified createScraper to incorporate extractData, enhancing its capability to parse nested structures while maintaining backward compatibility.

How to Test

Use a sample HTML file containing meta tags for nested fields, such as og:image, og:image:width, and og:image:height.
Define a schema with nested fields in the scraper configuration.
Run the scraper on the HTML content and verify that the nested fields (e.g., image URL, width, height) are accurately extracted and structured according to the schema.

Possible Regressions

None identified, as the changes are backward compatible. However, existing scrapers using deeply nested fields should be tested to confirm expected behavior.

Screenshots/Logs

// Expected Output
{
  title: 'Example Title',
  description: 'An example description.',
  keywords: ['typescript', 'html', 'parsing'],
  views: 1234,
  image: {
    url: 'https://example.se/images/...',
    width: 1372,
    height: 708
  }
}

Additional Notes

These changes allow for greater flexibility and extensibility in parsing structured HTML content with nested attributes, broadening the scraper’s utility for more complex data extraction tasks.

…nd improved feature list formatting

… in scrape configuration

…alidations for nested image data extraction

johnie added 4 commits October 26, 2024 17:09

feat: enhance README with nested field support for image extraction a…

694bc16

…nd improved feature list formatting

feat: expand FieldDefinition type to support nested field definitions…

2268a8c

… in scrape configuration

feat: add extractData helper to support nested field definitions

55dabad

feat: implement support for nested schemas in scraper tests and add v…

e64439e

…alidations for nested image data extraction

johnie merged commit a74dcb7 into main Oct 26, 2024
1 check passed

johnie deleted the feature/nested-fields branch October 26, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for nested fields #1

feat: add support for nested fields #1

johnie commented Oct 26, 2024 •

edited

Loading

feat: add support for nested fields #1

feat: add support for nested fields #1

Conversation

johnie commented Oct 26, 2024 • edited Loading

Pull Request Type

Summary

Changes Made

How to Test

Possible Regressions

Screenshots/Logs

Additional Notes

johnie commented Oct 26, 2024 •

edited

Loading