All scrapers should normailize whitespace in the 'title' and 'description' field. #75

onyxfish · 2009-09-05T17:44:20Z

There are some erroneous line returns and tabs getting stored. I haven't seen this with any other CSPAN scrapers, but its possible it is affecting others.

EDIT

A definition: a normalized field will be one where all whitespace between characters has been reduced to a single space.

chaunceyt · 2009-09-07T20:11:11Z

will check this.
Is this all fields?

chaunceyt · 2009-09-07T23:05:09Z

can you provide an example?

chaunceyt · 2009-09-08T00:12:58Z

let me know if I fixed this.

onyxfish · 2009-09-08T13:48:19Z

Don't have to time to check fix before work, but I wanted to follow-up anyway. I had to sleep on this one, because I'm worried about a sort of slippery slope issue if we start performing all sorts of post processing to "clean up" the data. However, the more I thought about the more reasonable it seemed. After all, we are generating some of these fields from scratch anyway, so why not make sure they are consistent.

So I think, that we should go ahead and add checks to the Scraper engines to make sure that the 'title' and 'description' fields are normalized. It seems to me the expected output formats for this data will expect that and we don't want a stray tab orl line return breaking someone's page formatting.

I'm retitling the question.

onyxfish · 2009-09-08T14:47:45Z

I forgot to mention that we should strip HTML tags from both these fields, too. I know you already did that for at least one scraper. Most of the links contained in the descriptions have relative URLs anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All scrapers should normailize whitespace in the 'title' and 'description' field. #75

All scrapers should normailize whitespace in the 'title' and 'description' field. #75

onyxfish commented Sep 5, 2009

chaunceyt commented Sep 7, 2009

chaunceyt commented Sep 7, 2009

chaunceyt commented Sep 8, 2009

onyxfish commented Sep 8, 2009

onyxfish commented Sep 8, 2009

All scrapers should normailize whitespace in the 'title' and 'description' field. #75

All scrapers should normailize whitespace in the 'title' and 'description' field. #75

Comments

onyxfish commented Sep 5, 2009

chaunceyt commented Sep 7, 2009

chaunceyt commented Sep 7, 2009

chaunceyt commented Sep 8, 2009

onyxfish commented Sep 8, 2009

onyxfish commented Sep 8, 2009