forked from brighton36/libcraigscrape
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
110 lines (96 loc) · 8.91 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
== Change Log
=== Release 1.1.1 (January 20, 2013)
- A new posting format showed up(posting_hil_cto_011913.html), added to specs and wrote handling code
- Fixed a bug affecting the parsing of dates on some postings
- Listing specs were failing due to the year change, resolved, and added a timecop dependency to development
=== Release 1.1 (December 25, 2012)
- ruby 1.9.3 support
- migrated from rails 2 gems to rails 3
- Replaced Net:Http with typhoeus
- Switched to the money gem for price storage
- fixed some new parsing bugs introduced by craigslist template changes
- added a tz parameter in the craigwatch report definition for overiding the
default timezone in your reports.
- added support for enable_starttls_auto and tls in craigwatch (Thanks olek!)
- added erb evaluation support to the craigwatch definition file (Thanks olek!)
=== Release 1.0 (February 13, 2011)
- Replaced hpricot dependency with Nokogiri. Nokogiri should be faster and more reliable. Whoo-hoo!
=== Release 0.9.1 (January 5, 2011)
- Added support for posting_has_expired? and expired post recognition
- Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering
=== Release 0.9 (Oct 01, 2010)
- Minor adjustments to craigwatch to fix deprecation warnings in new ActiveRecord and ActionMailer gems
- Added gem version specifiers to the Gem spec and to the require statements
- Moved repo to github
- Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was 'flagged for removal'.
- Took all those extra package-building tasts out of the Rakefile since this is 2010 and we only party with gemfiles
- Ruby 1.9 compatibility adjustments
=== Release 0.8.4 (Sep 6, 2010)
- Someone found a way to screw up hpricot's to_s method (posting1938291834-090610.html) and fixed by added html_source to the craigslist Scraper object, which returns the body of the post without passing it through hpricot. Its a better way to go anyways, and re-wrote a couple incidentals to use the html_source method...
- Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior
=== Release 0.8.3 (August 2, 2010)
- Someone was posting really bad html that was screwing up Hpricot. Such is to be expected when you're soliciting html from the general public I suppose. Added test_bugs_found061710 posting test, and fixed by stripping out the user body before parsing with Hpricot.
- Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10
=== Release 0.8.2 (April 17, 2010)
- Found another odd parsing bug. Scrape sample is in 'listing_samples/mia_search_kitten.3.15.10.html', Adjusted CraigScrape::Listings::HEADER_DATE to fix.
- Craigslist started added <span> tags in its post summaries. Fixed. See sample in test_new_listing_span051710
=== Release 0.8.1 (Feb 10, 2010)
- Found an odd parsing bug occured for the first time today. Scrape sample is in 'listing_samples/mia_sss_kittens2.10.10.html', Adjusted CraigScrape::Listings::LABEL to fix.
- Switched to require "active_support" per the deprecation notices
- Little adjustments to fix the rdoc readibility
=== Release 0.8.0 (Oct 22, 2009)
- Lots of substantial changes to the API & craigwatch (though backwards compatibility is mostly there)
- Added :code_tests to the rakefile
- Report definitions don't need a full path to the :dbfile, the parameter here can now be relative to the yaml file itself
- Added a Listings::next_page
- craigwatch: When not specifying a regex in _has or _has_no, we now perform an insensitive search
- Created CraigScrape::GeoListings.find_sites, & CraigScrape::GeoListings.sites_in_path methods
- <b>Large API changes</b> Added a constructor to CraigScrape, and changed a number of ways that sites are scraped
- Changed the format of the craigwatch tracking db - you'll need to delete any db's you already have and let the migrations re-run
- craigwatch is *much* more efficient with memory. Feel free to scrape the whole world now!
- craigwatch's yml changed a bit - documented in craigwatch
- We'll more or less automatically figure out the tracking_database in craigwatch if none is specified (will default to sqlite and auto-generated filename)
- craigwatch report_name is optional too now and can largely figure itself out
- Added summary_or_full_post_has and summary_or_full_post_has_no as craigwatch report parameters
- If a craigwatch search comes up empty - we now indicate that no results were found...
- Added location_has, location_has_no to craigwatch
- Cleaned up thge rdoc to clarify all the new syntax/features
- Added Scraper::retries_on_404_fail, Scraper::sleep_between_404_retries to help deal with some of the subtleties in handling connection reset errors different than the 404's
=== Release 0.7.0 (Jul 5, 2009)
- A good bit of refactoring
- Eager-loading in the Post object without the need of the full_post method
- full_post is no longer needed or available
- Added a Base Scraper object. Maybe I'll make that its own gem...
- Post/Listing constructors now take either a Hash or a url (string) to use for its scrape. Regardless, either should always have a url object association now
- Removed the PostSummary object, now we just have Posting/ Posts include all the functionality of both PostFull and PostSummary. Be careful with Posting if you're trying to minimize bandwidth. the rdoc labels which methods wont/will/might cause a page load.
- Things should fail better (always outputting relevant html that was unable to be parsed)
- Removed Posting::date in favor of Posting::post_date
- Posting::full_url is now url
- Post::href is no longer needed. Use Post::uri.path instead
- Posting.images renamed to Posting.pics, and a new method Posting.images better reflects the craigslist designations
- Fixed a bug with some search listings, where the last page might not get scrapped b/c craigslist doesn't correctly include the next page link. see 'test_nasty_search_listings' for an example
- On some listings (with the empty h4 tags) we were eager-loading when we could have made some safe assumptions instead. Fixed.
- Preliminary support for GeoListing scrapes.. (Not exactly sure where this will go - but I have ideas..)
- Adjusted Posting::images and Posting::pics to better match the craigslist media-dicotomy notation.
=== Release 0.6.5 (Jun 8, 2009)
- Added PostFull::deleted_by_author? , added test case for said condition
- Fixed a bug that caused the library to die in weird ways if there wasn't a title tag on a parsed page
- Apparently Craigslist starting gzip-encoding *some* listings. Added gzip decoding support
- Found a bug when parsing the location field on some full_posts in the apa sections
- Added support for file:// uri's int he scrape_* functions, and revised the tests to use these uri's
- Fixed a bug that caused errors to be raised with legitimately empty listing pages
=== Release 0.6.0 (May 21, 2009)
- Added PostFull::flagged_for_removal?
- Fixed a couple small parse bugs found in production
- Added a ParseError object, which is raised on ... detected Parse errors. This raise will output the buggy parse, please send this over to me if you think CraigsScrape is actually wrong ..
- Adjusted the examples in the readme, added a "require 'rubygems'" to the top of the listing so that they would actually work if you tried to run them verbatim (Thanks J T!)
- Restructured some of the parsing to be less leinient when scraped values aren't matching their regexp's in the PostSummary
- It seems like craigslist returns a 404 on pages that exist, for no good reason on occasion. Added a retry mechanism that wont take no for an answer, unless we get a defineable number of them in a row
- Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
- Adjusted craigwatch to not commit any database changes until the notification email goes out. This way if there's an error, the user wont miss any results on a re-run
- Added a FetchError for http requests that don't return 200 or redirect...
- Adjusted craigwatch to use scrape_until instead of scrape_since, this new approach cuts down on the url fetching by assuming that if we come across something we've already tracked, we dont need to keep going any further. NOTE: We still can't use a 'last_scraped_url' on the TrackedSearch model b/c sometimes posts get deleted.
- We're tracking all date-relevant posts now in craigwatch, not just the ones that matched the first time aorund, this should significantly speed up our scrapes
- Found an example of a screwy listing that was starting date h4's , but not actually listing the dates in them (see mia_fua_index8900.5.21.09.html test case). Added test case - handled appropriately. Seems like a bug in craigslist itself.
=== Release 0.5.0 (April 30, 2009)
- First release. Not much to say - hopefully someone else finds this useful.