You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some erroneous line returns and tabs getting stored. I haven't seen this with any other CSPAN scrapers, but its possible it is affecting others.
EDIT
A definition: a normalized field will be one where all whitespace between characters has been reduced to a single space.
The text was updated successfully, but these errors were encountered:
Don't have to time to check fix before work, but I wanted to follow-up anyway. I had to sleep on this one, because I'm worried about a sort of slippery slope issue if we start performing all sorts of post processing to "clean up" the data. However, the more I thought about the more reasonable it seemed. After all, we are generating some of these fields from scratch anyway, so why not make sure they are consistent.
So I think, that we should go ahead and add checks to the Scraper engines to make sure that the 'title' and 'description' fields are normalized. It seems to me the expected output formats for this data will expect that and we don't want a stray tab orl line return breaking someone's page formatting.
I forgot to mention that we should strip HTML tags from both these fields, too. I know you already did that for at least one scraper. Most of the links contained in the descriptions have relative URLs anyway.
There are some erroneous line returns and tabs getting stored. I haven't seen this with any other CSPAN scrapers, but its possible it is affecting others.
EDIT
A definition: a normalized field will be one where all whitespace between characters has been reduced to a single space.
The text was updated successfully, but these errors were encountered: