Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get older data #1

Open
Thicool opened this issue Oct 10, 2016 · 25 comments
Open

Get older data #1

Thicool opened this issue Oct 10, 2016 · 25 comments

Comments

@Thicool
Copy link

Thicool commented Oct 10, 2016

How to get older hashtags for a given hashtag as the provided function only mines the 21 posts on the starting page?

@panda0881
Copy link
Owner

8.get_media_from_tag(self, tag_name):

The input of this method is the tag name you are interested in. The output of this method are two lists. The first one contains all the media codes belong to top_post under this tag, while the second one is the full list of all the media codes under this tag.

Maybe you can look into this function to see if it solves your problem? The longer list should consist of all the medias under the specific hashtag

Best Regards,
Hongming

@panda0881
Copy link
Owner

Dear Jan,

It's nice hearing from you. If I don't get it wrong, you are trying to
collect information about all the medias under a specific hashtag, right?
If that is the case, the 21 posts you mentioned should be the content come
with the html document and the others are loaded with javascript function.
Typically, you have to use dynamic crawling to solve this problem. As you
can see in my codes, you can try the 8th function:

8.get_media_from_tag(self, tag_name):

The input of this method is the tag name you are interested in. The output
of this method are two lists. The first one contains all the media codes
belong to top_post under this tag, while the second one is the full list of
all the media codes under this tag.

If anything doesn't work or you still have question about that, feel free
to contact me😁, I'm very happy to solve the problem with you.

Best Regards,

Hongming

On Tue, Oct 11, 2016 at 12:14 AM, Thicool [email protected] wrote:

Hey there,
is there a way to get information about old tags?
I am new to python and would like to get the captions of posts that are
posted under a certain hashtag.
the script works fine but gives me back 21 posts, which is exactly the
numbers of posts that one finds on the explore page.
Is there a way to tell the script to find the older posts and extrat their
information as well?

thanks for your help,
Jan


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1, or mute the thread
https://github.com/notifications/unsubscribe-auth/AM7XSLE0oVmWF4oY3rxwmj2M4s_GFlm0ks5qymRJgaJpZM4KSvMW
.

@Thicool
Copy link
Author

Thicool commented Oct 11, 2016

Thanks for help! That works fine now. I modified the function that gets all the media code, and it now gets me the media codes, captions and dates 👍

@Thicool
Copy link
Author

Thicool commented Oct 12, 2016

Hey Hongming,
i tried to run my code to get all data for a tag but it seems that it somehow only works until around 4500 posts are scraped. Same issue with the original get_media_from_tag script.

error.pdf

do u have a guess what is going wrong there?

Thanks so much for your help!

@panda0881
Copy link
Owner

Hi Jan,

I looked into your problem. The error here is KeyError which indicates that there is something wrong with the data you get from the URL. There are several possible reasons for this problem. 1. The server may find out that you are a bot, 2. The server may deleted several pages, but the previous page still shows there are more. In my past experience, the second reason may be the most possible one. Anyway, I add a check to the program such that if there is a similar problem, it will let you know and keep running.

Thank you for letting me know.

Best Regards,
Hongming

@Thicool
Copy link
Author

Thicool commented Oct 12, 2016

Hey Hongming.
Thank for your fast response.
Your changes to the code helped me a lot in so far that when the problem occurs, the script stops without an error and the collected data can be used. However, for what i am trying to do, i propably need more data. Is there a way to skip the missing pages and continue afterwards instead of just stopping the whole program?

Thanks for your help!!

Regards,
Jan

edit: It is not a tag specific problem, also other tags get the error after around 4500 tags...

@panda0881
Copy link
Owner

You can try to build a loop to keep connect the server until you get the right response. Here is an example I used before in another program.

def request_until_succeed(url):
    response = s.get(url)
    success = False
    while success is False:
        try:
            if response['status_code'] == 200:
                success = True
        except:
            time.sleep(5)
            print("Retrying...")
    return response.read()

Maybe you can change the response['status_code'] == 200 into some code to check whether their is a 'media' or not

Best Regards,
Hongming

@Thicool
Copy link
Author

Thicool commented Oct 12, 2016

i looked at the last post that causes the error:
except KeyError: print result

and it returned:

{u'status': u'fail', u'message': u'\u5f88\u62b1\u6b49\uff0c\u8bf7\u6c42\u6b21\u6570\u8fc7\u591a\u3002\u8bf7\u7a0d\u540e\u91cd\u8bd5\u3002'}

There is no information in this post to get. My problem is that there is also no 'end_cursor' so i can not skip it and just go to the next post with

self.collect_media_list(tag_name, result['media']['page_info']['end_cursor'])

As i need this data for my studys i would be very thankful if you can look at that problem and maybe provide a solution....

Thanks so much
Jan

@panda0881
Copy link
Owner

Hi Jan,

Sorry for the delay response, I just got up. I understand your problem here, but the thing is if there is no appropriate response from the server, you can't skip it. There is a possible to solve that problem. You may considering debugging the program to find out the response from the server. The response state can tell you why you can't get the appropriate response. And you can solve the problem based on the real problem here. For instance, if you got blocked by the server, it may mean that you need to add a delay to the program.

Btw, do you mind telling me which hashtag you are interested in? Maybe I can help finding out where is problem.

Best Regards,
Hongming

@Thicool
Copy link
Author

Thicool commented Oct 13, 2016

Hey Hongming,
I am studying brand image perceptions so any global brand can be interesting for this study. I tried #mcdonalds to get the data. But the problem occurs on other hashtags at around the same point, so it is unlikely that the hashtag data itself has a problem i think.
Thanks for your help, really appreciate it!!!! :)

@panda0881
Copy link
Owner

Hi Jan,

I think I found a way to solve your problem. When I tested the program, the response from server is 429, which means that the bot has sent too many requests to the server. So I added a 0.5 second delay for each request. So far, the programs runs good and I can get more than 100000 medias.

But there is another problem, when you get too much loops(about 1000 recursion), you may meet the error: maximum recursion depth exceeded. This may depend on your computer. To solve that problem, you may considering change the recursion into loop structure.

Btw, I used to analyze brand image and data size over 10k may cause a problem in computation power.

Best Regards,
Hongming

@Thicool
Copy link
Author

Thicool commented Oct 13, 2016

Thanks for your patience and help.
However, the actual code does not seem to work for me.
After a small amount of data (sometime some hundrets, sometime around thousand) the script stops.

unbenannt

i have no idea what is going wrong now.

Best Regards,
Jan

@panda0881
Copy link
Owner

Hi Jan,

In my humble view, your problem here may be your bot is detected by the Instagram server. There are two ways to solve this problem. 1. you can change the time delay according to your own network situation and actual purpose. 2. change your IP address from time to time.

PS: the 0.5 seconds delay works fine for my situation, but you may need to change that according to your own situation.

For the second solution, you may want to check the following website:
How to avoid HTTP error 429 (Too Many Requests) python

Best Regards,
Hongming

@Thicool
Copy link
Author

Thicool commented Oct 18, 2016

Hi Hongming,

i created a looping version with sleep and it worked out more or less fine. At a high number (ran this two times now) i got a type error at around 30k. I have now idea how this can happen:
error
if u have no idea either, can you send me the 100k #mcdonalds file you created maybe? I dont know how to get this information on any other way as there is no adequat script to find on this page.
If it is a bot detection again, i now try to increase sleeping time again to 3 seconds.

edit: I tried to save the end_cursor to continue when there is a problem at that point. However, if i start with the last end_cursor, the program immediatly stops again.
edit: It seems that these end_cursor strings change over time so this makes no sense

Thanks so much
Jan

@panda0881
Copy link
Owner

Hi Jan,

I was busy with my midterm exams, sorry about that. I tried to collect the 100k file for you. But the program finished collecting data at 53k. I have the data attached. I will try again later, if I get any success, I will let you know.

Btw, it is a list stored in JSON format

Best Regards,
Hongming

@panda0881
Copy link
Owner

Hi Jan,

I tried for one more time and the program stops at the same position, which is 53156. So my guess is that total number of medias under this hashtag is 53156 and all the others may be deleted or stored in some other ancient storage machine and can't be accessed easily.

Btw, when the program stops, there is no error. it just shows that there is no next page.

Best Regards,
Hongming

@Thicool
Copy link
Author

Thicool commented Oct 25, 2016

Hey Hongkong,

My biggest succes was 50k as Well. I am trying around vpn and ip Reset
stuff but no success. But if i keep repeating this over the next month, i
think the Data will be okay for what i am doing. Thanks for all ur help and
if u have a breakthrough, let me know :)

Good luck on your exams and Cheers from Germany

Am 22.10.2016 10:22 vorm. schrieb "Hongming ZHANG" <[email protected]

:

Hi Jan,

I was busy with my midterm exams, sorry about that. I tried to collect the
100k file for you. But the program finished collecting data at 53k. I have
the data attached. I will try again later, if I get any success, I will let
you know.

Btw, it is a list stored in JSON format

Best Regards,
Hongming


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AVm_4nn3wHVTVEKbIM2bnDLIEPk0_WmKks5q2ce-gaJpZM4KSvMW
.

@Thicool
Copy link
Author

Thicool commented Mar 20, 2017 via email

@panda0881
Copy link
Owner

Hi Jan,

I just tried your problem, I can't get information from that page before I log in. but once I logged in, I can successfully get data from that page. maybe you can try the log in function first. If you still have any problem, let me know~

Hongming

@Thicool
Copy link
Author

Thicool commented Mar 21, 2017 via email

@panda0881
Copy link
Owner

Hi Jan,

I think I just fixed the problem. in the header file, the term of x-instagram-ajax should be '1' rather than 1. Thanks for the bug report haha.

You can pull it again and try it, if you have any other questions, let me know~

Good luck on your project, I'm glad that this little project helps you.

Hongming

@Thicool
Copy link
Author

Thicool commented Mar 21, 2017 via email

@Thicool
Copy link
Author

Thicool commented Jun 29, 2017 via email

@Thicool
Copy link
Author

Thicool commented Jun 29, 2017 via email

@panda0881
Copy link
Owner

panda0881 commented Jun 30, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants