Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated testing #23

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

Automated testing #23

wants to merge 24 commits into from

Conversation

pR0Ps
Copy link
Member

@pR0Ps pR0Ps commented Nov 29, 2014

We have different Python versions (2.7, 3.3, 3.4, etc), as well as different parsing libraries (lxml, html5) to support. It's pretty much impossible to manually test all the combinations so we should create a test suite to do it for us.

I'm thinking that we could dump a few different examples of each page type (letter, expanded subject dropdown, course, term, etc), into local files then make some tests that open them and make sure that we can parse everything out of them properly using the scraper.

I'm thinking we should have a file that specifies which subjects, courses, etc that we want to test, as well as an updater script to deal with updating them.

The updater script could use the scraper to grab the pages and store the HTML of them locally, as well as store the actual scraped data. Tests run against the data could test their output and make sure it's correct. This way we can keep track of which pages we're testing against, as well as have an easy way to keep them all updated. We could then add problematic pages as we come across them (Ex: the CISC subject, see #18 ) to prevent regressions.

Some work will have to be done to the scraper to make sure that we can pull out individual courses by name (not id). The rest of the types (subject, section, etc) already pull enough attributes out of the page that we can use to specify them by name, but the all_courses function only returns the _unique of the course, not anything non-volatile (like the course number). I think this was done just to avoid duplicating information that would be scraped out of the course page (once it was loaded), not for any technical reason so it should be relatively easy to change.

Obviously if the SOLUS html changes and scraper breaks the updater script won't be able to update the stored pages and scraped data, but in those cases we can manually update the files for local testing, then when the problem is fixed, run the updater again.

I don't know if we want to actually store the HTML files in the repo or not as we might run into legal issues (putting private data in a public Git repo).

A huge benefit of this is that we won't need to actually touch SOLUS except when updating the local tests, fixing authentication issues, or actually scraping. This will speed up development, as well as allow us to not hammer SOLUS (and risk getting banned) while testing.

Anyway, this is just a huge ideas dump, let me know your thoughts on it.

@mystor
Copy link
Member

mystor commented Nov 28, 2014

I think that directly storing the HTML from SOLUS in this repository is a bad idea, because of all of the landmines related to private information on SOLUS pages. For example, if I were to drop the HTML from my SOLUS onto the page, you would be able to see a lot of information about my schedule, how much money the university owes me/I owe the university, exam times, enrollment dates etc.

That being said, good test cases aren't always real-world test cases. I think that every time there is a change to the scraper, we could create artificial example pages which have similar structure to the ones provided by SOLUS, and which allow us to ensure that there are no selector regressions. It'll be a lot more work than the actual case, but it'll give us much more control.

Unfortunately, I'm not convinced that the scraper code is modular enough right now for that to be practical.

(NB: I don't think that we should support all of those python versions. I think we should only support Python 3.3+, and drop support for python2. It'll simplify testing & development with almost no drawbacks.)

@pR0Ps pR0Ps self-assigned this Dec 1, 2014
@pR0Ps pR0Ps force-pushed the automated_testing branch from 34a841d to 08f3398 Compare December 3, 2014 04:11
@pR0Ps pR0Ps force-pushed the automated_testing branch from 7fe5137 to f98b050 Compare December 6, 2014 20:27
@pR0Ps
Copy link
Member Author

pR0Ps commented Jan 16, 2015

Currently the HTML and scraped data is being dumped into some files.

The remaining work is making a fake session that reads the files instead of doing a request for the page and knows how to transition from file to file (ex. from A -> ANAT - > ANAT 400) based on data scraped out of the downloaded files.

@pR0Ps pR0Ps mentioned this pull request Jun 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants