Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All scrapers should normailize whitespace in the 'title' and 'description' field. #75

Open
onyxfish opened this issue Sep 5, 2009 · 5 comments

Comments

@onyxfish
Copy link
Owner

onyxfish commented Sep 5, 2009

There are some erroneous line returns and tabs getting stored. I haven't seen this with any other CSPAN scrapers, but its possible it is affecting others.

EDIT

A definition: a normalized field will be one where all whitespace between characters has been reduced to a single space.

@chaunceyt
Copy link
Collaborator

will check this.
Is this all fields?

@chaunceyt
Copy link
Collaborator

can you provide an example?

@chaunceyt
Copy link
Collaborator

let me know if I fixed this.

@onyxfish
Copy link
Owner Author

onyxfish commented Sep 8, 2009

Don't have to time to check fix before work, but I wanted to follow-up anyway. I had to sleep on this one, because I'm worried about a sort of slippery slope issue if we start performing all sorts of post processing to "clean up" the data. However, the more I thought about the more reasonable it seemed. After all, we are generating some of these fields from scratch anyway, so why not make sure they are consistent.

So I think, that we should go ahead and add checks to the Scraper engines to make sure that the 'title' and 'description' fields are normalized. It seems to me the expected output formats for this data will expect that and we don't want a stray tab orl line return breaking someone's page formatting.

I'm retitling the question.

@onyxfish
Copy link
Owner Author

onyxfish commented Sep 8, 2009

I forgot to mention that we should strip HTML tags from both these fields, too. I know you already did that for at least one scraper. Most of the links contained in the descriptions have relative URLs anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants