Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for nested fields #1

Merged
merged 4 commits into from
Oct 26, 2024
Merged

feat: add support for nested fields #1

merged 4 commits into from
Oct 26, 2024

Conversation

johnie
Copy link
Owner

@johnie johnie commented Oct 26, 2024

Pull Request Type

  • Feature
  • Bug Fix
  • Refactor
  • General Change

Summary

This pull request introduces nested field support for HTML data extraction, enhancing the createScraper function to handle nested structures within the data schema. Additionally, improvements were made to the README for better readability and documentation on these new nested field features.

Changes Made

  1. README Enhancements:

    • Improved formatting of the feature list for readability.
    • Added examples of the newly supported nested field extraction to demonstrate the feature.
  2. Nested Field Support in Types:

    • Expanded the FieldDefinition type to support nested field definitions through a new NestedFieldDefinition type. This update allows fields to be defined with sub-fields, enabling the extraction of complex HTML structures.
  3. Helper Function extractData:

    • Created a new helper function extractData within createScraper.ts to manage nested field extraction recursively. This function processes each field in the schema, distinguishing between simple and nested fields, and extracting values accordingly.
  4. Update to createScraper:

    • Modified createScraper to incorporate extractData, enhancing its capability to parse nested structures while maintaining backward compatibility.

How to Test

  1. Use a sample HTML file containing meta tags for nested fields, such as og:image, og:image:width, and og:image:height.
  2. Define a schema with nested fields in the scraper configuration.
  3. Run the scraper on the HTML content and verify that the nested fields (e.g., image URL, width, height) are accurately extracted and structured according to the schema.

Possible Regressions

  • None identified, as the changes are backward compatible. However, existing scrapers using deeply nested fields should be tested to confirm expected behavior.

Screenshots/Logs

// Expected Output
{
  title: 'Example Title',
  description: 'An example description.',
  keywords: ['typescript', 'html', 'parsing'],
  views: 1234,
  image: {
    url: 'https://example.se/images/...',
    width: 1372,
    height: 708
  }
}

Additional Notes

These changes allow for greater flexibility and extensibility in parsing structured HTML content with nested attributes, broadening the scraper’s utility for more complex data extraction tasks.

@johnie johnie merged commit a74dcb7 into main Oct 26, 2024
1 check passed
@johnie johnie deleted the feature/nested-fields branch October 26, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant