-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get older data #1
Comments
8.get_media_from_tag(self, tag_name): The input of this method is the tag name you are interested in. The output of this method are two lists. The first one contains all the media codes belong to top_post under this tag, while the second one is the full list of all the media codes under this tag. Maybe you can look into this function to see if it solves your problem? The longer list should consist of all the medias under the specific hashtag Best Regards, |
Dear Jan, It's nice hearing from you. If I don't get it wrong, you are trying to 8.get_media_from_tag(self, tag_name): The input of this method is the tag name you are interested in. The output If anything doesn't work or you still have question about that, feel free Best Regards, Hongming On Tue, Oct 11, 2016 at 12:14 AM, Thicool [email protected] wrote:
|
Thanks for help! That works fine now. I modified the function that gets all the media code, and it now gets me the media codes, captions and dates 👍 |
Hey Hongming, do u have a guess what is going wrong there? Thanks so much for your help! |
Hi Jan, I looked into your problem. The error here is KeyError which indicates that there is something wrong with the data you get from the URL. There are several possible reasons for this problem. 1. The server may find out that you are a bot, 2. The server may deleted several pages, but the previous page still shows there are more. In my past experience, the second reason may be the most possible one. Anyway, I add a check to the program such that if there is a similar problem, it will let you know and keep running. Thank you for letting me know. Best Regards, |
Hey Hongming. Thanks for your help!! Regards, edit: It is not a tag specific problem, also other tags get the error after around 4500 tags... |
You can try to build a loop to keep connect the server until you get the right response. Here is an example I used before in another program.
Maybe you can change the response['status_code'] == 200 into some code to check whether their is a 'media' or not Best Regards, |
i looked at the last post that causes the error: and it returned:
There is no information in this post to get. My problem is that there is also no 'end_cursor' so i can not skip it and just go to the next post with
As i need this data for my studys i would be very thankful if you can look at that problem and maybe provide a solution.... Thanks so much |
Hi Jan, Sorry for the delay response, I just got up. I understand your problem here, but the thing is if there is no appropriate response from the server, you can't skip it. There is a possible to solve that problem. You may considering debugging the program to find out the response from the server. The response state can tell you why you can't get the appropriate response. And you can solve the problem based on the real problem here. For instance, if you got blocked by the server, it may mean that you need to add a delay to the program. Btw, do you mind telling me which hashtag you are interested in? Maybe I can help finding out where is problem. Best Regards, |
Hey Hongming, |
Hi Jan, I think I found a way to solve your problem. When I tested the program, the response from server is 429, which means that the bot has sent too many requests to the server. So I added a 0.5 second delay for each request. So far, the programs runs good and I can get more than 100000 medias. But there is another problem, when you get too much loops(about 1000 recursion), you may meet the error: maximum recursion depth exceeded. This may depend on your computer. To solve that problem, you may considering change the recursion into loop structure. Btw, I used to analyze brand image and data size over 10k may cause a problem in computation power. Best Regards, |
Hi Jan, In my humble view, your problem here may be your bot is detected by the Instagram server. There are two ways to solve this problem. 1. you can change the time delay according to your own network situation and actual purpose. 2. change your IP address from time to time. PS: the 0.5 seconds delay works fine for my situation, but you may need to change that according to your own situation. For the second solution, you may want to check the following website: Best Regards, |
Hi Jan, I was busy with my midterm exams, sorry about that. I tried to collect the 100k file for you. But the program finished collecting data at 53k. I have the data attached. I will try again later, if I get any success, I will let you know. Btw, it is a list stored in JSON format Best Regards, |
Hi Jan, I tried for one more time and the program stops at the same position, which is 53156. So my guess is that total number of medias under this hashtag is 53156 and all the others may be deleted or stored in some other ancient storage machine and can't be accessed easily. Btw, when the program stops, there is no error. it just shows that there is no next page. Best Regards, |
Hey Hongkong, My biggest succes was 50k as Well. I am trying around vpn and ip Reset Good luck on your exams and Cheers from Germany Am 22.10.2016 10:22 vorm. schrieb "Hongming ZHANG" <[email protected]
|
Hi Hongming,
i have been doing pretty good on crawling instagram and my master thesis
about "brands on instagram" is nearly done. My professor even wants me to
do my PhD on this stream of research. Again big thanks for your help to
archive this!!!!
However, since some days, accessing the instagram feed does not work
anymore for me.
I have tried different http librarys for python like requests, httplib2,
urlb2 but i always get an 404 error when i try to get data from:
"https://www.instagram.com/explore/tags/porsche/" or any other hashtag.
However, other urls from instagram work finde, so i guess they changed
something on the url of their feed. I am confused because i can access the
url on my browser but python cannot find it.
I dont want to bother you but do you have any idea of what they changed to
that i can not access their data anymore?
Many thanks for your advice and all the best,
Jan
2016-10-22 12:29 GMT+02:00 Hongming ZHANG <[email protected]>:
… Hi Jan,
I tried for one more time and the program stops at the same position,
which is 53156. So my guess is that total number of medias under this
hashtag is 53156 and all the others may be deleted or stored in some other
ancient storage machine and can't be accessed easily.
Btw, when the program stops, there is no error. it just shows that there
is no next page.
Best Regards,
Hongming
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVm_4mXWYT8lwtEB1xaz5vyF353SBFv-ks5q2eWPgaJpZM4KSvMW>
.
|
Hi Jan, I just tried your problem, I can't get information from that page before I log in. but once I logged in, I can successfully get data from that page. maybe you can try the log in function first. If you still have any problem, let me know~ Hongming |
Hi,
thanks for your answer.Thats what i figured out as well now, they changed
the login so that you now need to login to see "recent uploads".
So now i am stuck with the login function. I created a dummy account for
crawling but somehow get this error:
M = InstagramSpider()
M.login('userbrand10001', 'passwortbrand10001')
…-->InvalidHeader: Header value 1 must be of type str or bytes, not <type
'int'>
Some month ago, i had no problem using the login function. any ideas? See
attachment for full error message.
Thank you so much! Definitly need to mention you if the paper gets
published :)
cheers
jan
2017-03-21 5:35 GMT+01:00 Hongming ZHANG <[email protected]>:
Hi Jan,
I just tried your problem, I can't get information from that page before I
log in. but once I logged in, I can successfully get data from that page.
maybe you can try the log in function first. If you still have any problem,
let me know~
Hongming
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVm_4tTp_d6gg4jILre7iltQOdbzDDKrks5rn1OugaJpZM4KSvMW>
.
---------------------------------------------------------------------------
InvalidHeader Traceback (most recent call last)
<ipython-input-4-1cabee21dd0d> in <module>()
1 M = InstagramSpider()
----> 2 M.login('userbrand10001', 'passwortbrand10001')
<ipython-input-3-db1115ab7f0e> in login(self, username, password)
53 }
54 data = {'username': username, 'password': password}
---> 55 self.s.post('https://www.instagram.com/accounts/login/ajax/', data=data, headers=headers)
56
57 def get_user_data(self, name):
C:\Users\Jankl\Anaconda2\lib\site-packages\requests\sessions.pyc in post(self, url, data, json, **kwargs)
533 """
534
--> 535 return self.request('POST', url, data=data, json=json, **kwargs)
536
537 def put(self, url, data=None, **kwargs):
C:\Users\Jankl\Anaconda2\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
472 hooks = hooks,
473 )
--> 474 prep = self.prepare_request(req)
475
476 proxies = proxies or {}
C:\Users\Jankl\Anaconda2\lib\site-packages\requests\sessions.pyc in prepare_request(self, request)
405 auth=merge_setting(auth, self.auth),
406 cookies=merged_cookies,
--> 407 hooks=merge_hooks(request.hooks, self.hooks),
408 )
409 return p
C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.pyc in prepare(self, method, url, headers, files, data, params, auth, cookies, hooks, json)
301 self.prepare_method(method)
302 self.prepare_url(url, params)
--> 303 self.prepare_headers(headers)
304 self.prepare_cookies(cookies)
305 self.prepare_body(data, files, json)
C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.pyc in prepare_headers(self, headers)
441 for header in headers.items():
442 # Raise exception on invalid header value.
--> 443 check_header_validity(header)
444 name, value = header
445 self.headers[to_native_string(name)] = value
C:\Users\Jankl\Anaconda2\lib\site-packages\requests\utils.pyc in check_header_validity(header)
794 except TypeError:
795 raise InvalidHeader("Header value %s must be of type str or bytes, "
--> 796 "not %s" % (value, type(value)))
797
798
InvalidHeader: Header value 1 must be of type str or bytes, not <type 'int'>
|
Hi Jan, I think I just fixed the problem. in the header file, the term of x-instagram-ajax should be '1' rather than 1. Thanks for the bug report haha. You can pull it again and try it, if you have any other questions, let me know~ Good luck on your project, I'm glad that this little project helps you. Hongming |
Login works finde now! I will implement it to the rest of my bot later and
see if i can bring my crawler back to work :D
Thank you so much!
2017-03-21 14:00 GMT+01:00 Hongming ZHANG <[email protected]>:
… Hi Jan,
I think I just fixed the problem. in the header file, the term of
x-instagram-ajax should be '1' rather than 1. Thanks for the bug report
haha.
You can pull it again and try it, if you have any other questions, let me
know~
Good luck on your project, I'm glad that this little project helps you.
Hongming
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW>
.
|
Hey Hongming,
i have another problem occuring with the functions that use cookies, here
is an example error message:
…_______________________________________________________________________________________________
Traceback (most recent call last):
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 450, in <module>
data = M.get_media_from_tag('droetker')
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 335, in get_media_from_tag
self.collect_media_list(tag_name,
data['media']['page_info']['end_cursor'])
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 313, in collect_media_list
result = tmp_result.json()
File "C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.py",
line 866, in json
return complexjson.loads(self.text, **kwargs)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\__init__.py",
line 501, in loads
return _default_decoder.decode(s)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py",
line 370, in decode
obj, end = self.raw_decode(s)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py",
line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char
0)
_______________________________________________________________________________________________
I changed the "x-instagram-ajax should be '1' rather than 1" things because
the error that i described above appeared too.
If you think this is a simple problem to solve i would be very happy if you
can help me on this one. If it is difficult, i need to look for other
solutions.
Thanks for your help!
btw: We have submitted our first paper to a marketing journal, if it will
be published someday, i`ll let you know :)
Big Thanks and kind regards,
Jan
2017-03-21 14:40 GMT+01:00 Jan Klostermann <[email protected]>:
Login works finde now! I will implement it to the rest of my bot later and
see if i can bring my crawler back to work :D
Thank you so much!
2017-03-21 14:00 GMT+01:00 Hongming ZHANG ***@***.***>:
> Hi Jan,
>
> I think I just fixed the problem. in the header file, the term of
> x-instagram-ajax should be '1' rather than 1. Thanks for the bug report
> haha.
>
> You can pull it again and try it, if you have any other questions, let me
> know~
>
> Good luck on your project, I'm glad that this little project helps you.
>
> Hongming
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#1 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW>
> .
>
|
I found out that i get an <Response [405]> so no data was received in line
119:
if not end_cursor:
data = 'q=ig_user(' + user_id + \
')+%7B%0A++followed_by.first(10)+%7B%0A++++count%2C%0A++++page_info+%7B%0A++++++end_cursor%2C%0A+'
\
'+++++has_next_page%0A++++%7D%2C%0A++++nodes+%7B%0A++++++id%2C%0A++++++is_verified%2C%0A++++++fol'
\
'lowed_by_viewer%2C%0A++++++requested_by_viewer%2C%0A++++++full_name%2C%0A++++++profile_pic_url%2'
\
'C%0A++++++username%0A++++%7D%0A++%7D%0A%7D%0A&ref=relationships%3A%3Afollow_list'
result = self.s.post('https://www.instagram.com/query/',
data=data, headers=headers)
2017-06-29 11:23 GMT+02:00 Jan Klostermann <[email protected]>:
… Hey Hongming,
i have another problem occuring with the functions that use cookies, here
is an example error message:
____________________________________________________________
___________________________________
Traceback (most recent call last):
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 450, in <module>
data = M.get_media_from_tag('droetker')
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 335, in get_media_from_tag
self.collect_media_list(tag_name, data['media']['page_info']['
end_cursor'])
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 313, in collect_media_list
result = tmp_result.json()
File "C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.py",
line 866, in json
return complexjson.loads(self.text, **kwargs)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\__init__.py",
line 501, in loads
return _default_decoder.decode(s)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py",
line 370, in decode
obj, end = self.raw_decode(s)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py",
line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1
(char 0)
____________________________________________________________
___________________________________
I changed the "x-instagram-ajax should be '1' rather than 1" things
because the error that i described above appeared too.
If you think this is a simple problem to solve i would be very happy if
you can help me on this one. If it is difficult, i need to look for other
solutions.
Thanks for your help!
btw: We have submitted our first paper to a marketing journal, if it will
be published someday, i`ll let you know :)
Big Thanks and kind regards,
Jan
2017-03-21 14:40 GMT+01:00 Jan Klostermann ***@***.***>:
> Login works finde now! I will implement it to the rest of my bot later
> and see if i can bring my crawler back to work :D
> Thank you so much!
>
> 2017-03-21 14:00 GMT+01:00 Hongming ZHANG ***@***.***>:
>
>> Hi Jan,
>>
>> I think I just fixed the problem. in the header file, the term of
>> x-instagram-ajax should be '1' rather than 1. Thanks for the bug report
>> haha.
>>
>> You can pull it again and try it, if you have any other questions, let
>> me know~
>>
>> Good luck on your project, I'm glad that this little project helps you.
>>
>> Hongming
>>
>> —
>> You are receiving this because you authored the thread.
>> Reply to this email directly, view it on GitHub
>> <#1 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW>
>> .
>>
>
>
|
Dear Jan,
Sorry for the late reply. I'm currently out of my office. Will look into the code tomorrow. I will see how can I help. As you know, Instagram may change their backend from time to time. So we may need to change our system from time to time haha.
Btw, congratulations on the paper^_^
Best regards,
Hongming
…Sent from my iPhone
On 29 Jun, 2017, at 5:23 pm, Thicool ***@***.***> wrote:
Hey Hongming,
i have another problem occuring with the functions that use cookies, here
is an example error message:
_______________________________________________________________________________________________
Traceback (most recent call last):
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 450, in <module>
data = M.get_media_from_tag('droetker')
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 335, in get_media_from_tag
self.collect_media_list(tag_name,
data['media']['page_info']['end_cursor'])
File "C:/Users/Jankl/PycharmProjects/Get_old_insta/Instagram_Spider.py",
line 313, in collect_media_list
result = tmp_result.json()
File "C:\Users\Jankl\Anaconda2\lib\site-packages\requests\models.py",
line 866, in json
return complexjson.loads(self.text, **kwargs)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\__init__.py",
line 501, in loads
return _default_decoder.decode(s)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py",
line 370, in decode
obj, end = self.raw_decode(s)
File "C:\Users\Jankl\Anaconda2\lib\site-packages\simplejson\decoder.py",
line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char
0)
_______________________________________________________________________________________________
I changed the "x-instagram-ajax should be '1' rather than 1" things because
the error that i described above appeared too.
If you think this is a simple problem to solve i would be very happy if you
can help me on this one. If it is difficult, i need to look for other
solutions.
Thanks for your help!
btw: We have submitted our first paper to a marketing journal, if it will
be published someday, i`ll let you know :)
Big Thanks and kind regards,
Jan
2017-03-21 14:40 GMT+01:00 Jan Klostermann ***@***.***>:
> Login works finde now! I will implement it to the rest of my bot later and
> see if i can bring my crawler back to work :D
> Thank you so much!
>
> 2017-03-21 14:00 GMT+01:00 Hongming ZHANG ***@***.***>:
>
>> Hi Jan,
>>
>> I think I just fixed the problem. in the header file, the term of
>> x-instagram-ajax should be '1' rather than 1. Thanks for the bug report
>> haha.
>>
>> You can pull it again and try it, if you have any other questions, let me
>> know~
>>
>> Good luck on your project, I'm glad that this little project helps you.
>>
>> Hongming
>>
>> —
>> You are receiving this because you authored the thread.
>> Reply to this email directly, view it on GitHub
>> <#1 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AVm_4lA5jmPa2S9xdC-NIrvdh6-RBewmks5rn8nmgaJpZM4KSvMW>
>> .
>>
>
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
How to get older hashtags for a given hashtag as the provided function only mines the 21 posts on the starting page?
The text was updated successfully, but these errors were encountered: