Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem pulling more than 10,000 clientIDs #283

Open
allen-black opened this issue Oct 15, 2019 · 37 comments
Open

Problem pulling more than 10,000 clientIDs #283

allen-black opened this issue Oct 15, 2019 · 37 comments
Assignees
Labels

Comments

@allen-black
Copy link

allen-black commented Oct 15, 2019

What goes wrong

I am trying to pull the number of page views that each clientID has over a period of time (longer-term goal is to get ga:clientID and ga:pagePath together, but I run into this issue in that report, too), I'm getting just a result of 10,000 rows. I get this no matter the date range and GA tells me I have more users than that in the time period. I'm able to get dimensions to report more rows, but for some reason this one is limiting me?

I had max=-1 in before and just tried 999999 based on previous posts.

pageViews <- function (start, end) {
  data <- google_analytics(
    viewId = ga_id,
    date_range = c(start, end),
    metrics = "ga:pageviews",
    dimensions = c('ga:clientID'),
    max = 999999
  )
  return(data)
}

Steps to reproduce the problem

  • Find a View that has more than 10,000 unique clientIDs in a period.
  • Use the code above and see if you get a greater result.

Expected output

  • More than 10,000 rows of unique clientIDs

Actual output

  • Always limited to 10,000 no matter the date.

##Before you run your code, please run:
2019-10-15 07:47:11> Multi-call to API 2019-10-15 07:47:12> Calling APIv4.... 2019-10-15 07:47:12> Multiple v4 batch 2019-10-15 07:47:12> Fetching v4 data batch... 2019-10-15 07:47:12> Request: https://analyticsreporting.googleapis.com/v4/reports:batchGet/ 2019-10-15 07:47:12> Body JSON parsed to: {"reportRequests":[{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"0","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"10000","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"20000","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECI....[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"980000","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"990000","pageSize":9999,"includeEmptyRows":true}]} 2019-10-15 07:49:22> Downloaded [10000] rows from a total of [10001].

Session Info

Please run sessionInfo() so we can check what versions of packages you have installed
`Session Info
R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] factoextra_1.0.5 cluster_2.0.7-1 skmeans_0.2-11
[4] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.2
[7] purrr_0.3.2 readr_1.3.1 tidyr_0.8.3
[10] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1
[13] googleAnalyticsR_0.6.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.4.2 compiler_3.4.4 googleAuthR_0.8.0
[6] tools_3.4.4 digest_0.6.20 packrat_0.5.0 clue_0.3-57 lubridate_1.7.4
[11] jsonlite_1.6 memoise_1.1.0 nlme_3.1-137 gtable_0.3.0 lattice_0.20-38
[16] pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0 rstudioapi_0.10 curl_3.3
[21] ggrepel_0.8.1 haven_2.1.1 withr_2.1.2 xml2_1.2.2 httr_1.4.0
[26] askpass_1.1 generics_0.0.2 hms_0.4.2 grid_3.4.4 tidyselect_0.2.5
[31] glue_1.3.1 R6_2.4.0 readxl_1.3.1 modelr_0.1.4 magrittr_1.5
[36] backports_1.1.4 scales_1.0.0 rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1
[41] stringi_1.4.3 openssl_1.4 lazyeval_0.2.2 munsell_0.5.0 slam_0.1-45
[46] broom_0.5.2 crayon_1.3.4

`

@allen-black
Copy link
Author

One other note: I have another view that saves the clientID to via a customTask to custom dimension 19. If I use that instead of clientID, I am able to get 51,436 results when I pull with ga:dimension19, ga:pagePath.

@MarkEdmondson1234
Copy link
Collaborator

I can confirm this, and its weird. Can you see if it occurs via another library such as GA query explorer: https://ga-dev-tools.appspot.com/query-explorer/

If so, then its an API bug to be reported to Google

@MarkEdmondson1234
Copy link
Collaborator

Query explorer doesn't have clientID dimension yet either. But I tried it also with a property that is capturing clientId in dimension75 and it downloaded it no issue 1million entries (max total) but same viewId only 10,000 available via clientId dimension.

For now stick to using the dimension if you have it, but I think this is a bug to report to Google.

@MarkEdmondson1234
Copy link
Collaborator

I put it on this issue, if you have more details please put them there https://issuetracker.google.com/issues/142795352

@allen-black
Copy link
Author

Thanks for looking into this! When I run that wiht a different ViewID I definition get more data and it ends with nextPageToken": "19999"

@Sozonov
Copy link

Sozonov commented Dec 10, 2019

Hello. A have same problem. Any changes here?

@MarkEdmondson1234
Copy link
Collaborator

No updates yet, its within the API the issue occurs so need to wait for any updates there.

@Sozonov
Copy link

Sozonov commented Dec 10, 2019

@MarkEdmondson1234 thanks. Do you have link to issue in API? Is working on this issue started?

@MarkEdmondson1234
Copy link
Collaborator

Not really, the private thread at the link is just with an example. It is officially in beta, so if it comes out of beta it should appear in the GA API news feed or then be available in the query explorer online.

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Mar 17, 2020

It looks like this may be a limit per API call, so if the API call is broken down into calls under 10k rows you can get all data. If this is confirmed I'll auto do this in the function.

To do this something like below should work:

ga_call <- function(date_range, ...){
  per_day <- seq(date_range[[1]]), date_range[[2]], by = 1)
  calls <- lapply(per_day, function(x){
      message("Fetching: ", x)
      google_analytics(..., date_range = c(x, x))
  })
  Reduce(rbind, calls)
}

my_date_range <- c(as.Date("2019-01-01"), as.Date("2020-01-01"))

per_call(date_range = my_date_range, {put other google_analytics() arguments here})

It may be doable by using slow_fetch=TRUE:

google_analytics({etc}, slow_fetch=TRUE)

This won't help if you have more than 10k users a day. That will need to wait for the API to update.

@allen-black
Copy link
Author

Awesome. Thank you for continuing to look at this, @MarkEdmondson1234 ! I look forward to hearing if you get confirmation and are able to update the function as you note.

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented May 27, 2020

Maciej Franas has this work around if you need more than 10k a day:

For anyone else that would be looking for an answer in the future (hello, time-traveler) here's the answer

for(i in 1:length(z)){
  day <- as.list(rep(0, 24))
  for(j in 1:length(hours)){
    day[[j]] <- google_analytics(view,
                                 date_range =c(z[i], z[i]),
                                 dimensions = c('ga:deviceCategory',
                                                 'ga:clientId', 
                                                 'ga:hour'),
                                 metrics = c('ga:users'), 
                                 filtersExpression = paste0('ga:users>0;ga:hour==',hours[j]),
                                 anti_sample = T
    )
  }
  output[[i]]<- do.call('rbind', day)
}

where 'z' is dates vector and 'hours' is vector with 2-digits string hours (e.g. '01', '02').
This is obviously painfully slow and most likely will result in exhausting API limits but nothing else works - at least for now.

@mgahan-he
Copy link

Thanks to the work around for this!

@mgahan
Copy link

mgahan commented Dec 11, 2020

Anyone try this lately? I have tried it a couple of times tonight and I have not been running into the limit. I wonder if this has been magically fixed!!

@RhysJackson
Copy link

Anyone try this lately? I have tried it a couple of times tonight and I have not been running into the limit. I wonder if this has been magically fixed!!

Agreed, I'm no longer hitting the 10,000 row limit when pulling clientId. No official update published in the API changelog though: https://developers.google.com/analytics/devguides/changelog

@mgahan-he
Copy link

I notice that I still run into the limit at certain times - I don’t have a great feel for when it works and when it doesn’t.

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Feb 9, 2021

@everleazy is still getting the limit:

Hello.
Sorry in advance - I use Google translator. =)

Faced a problem - I can't get more than 10,000 rows when requesting data from a client.
There are about 30,000 users in Analytics per day.
When I try to query in 1 day, I get exactly 10,000 rows.

Tried with the parameter "max" (-1), anti_sample.
I still only get 10,000 lines.

@mgahan
Copy link

mgahan commented Feb 9, 2021

I am finding that I get limited out if I pull today or yesterday’s client IDs.

But if I pull 2 or more days in the past, it seems to work great. Not sure why this happens.

@DC-A
Copy link

DC-A commented Feb 24, 2021

Hi!

I've always downloaded client id and got limitless results, but suddenly I just got 10,000 lines.
Someone has figured out why it happens?

@mgahan-he
Copy link

mgahan-he commented Aug 27, 2021

For the past 9 months or so, I have been able to pull clientId without much problem. However, starting yesterday, it appears like I am hitting the 10k limit once again.

@PedjaV
Copy link

PedjaV commented Sep 6, 2021

For the past 9 months or so, I have been able to pull clientId without much problem. However, starting yesterday, it appears like I am hitting the 10k limit once again.

I have same problem, and one interesting thing, when remove from my code "campaign", code work just fine.
is anyone find what is the problem?

@mgahan-he
Copy link

@PedjaV Great find! Just tried without campaign code and it appears to work well. Bummer I cant get campaign code but easier than pulling everything by the hour for now.

@wilfreddesert
Copy link

I still can't get more than 10k rows when using just ga:date, ga:clientId and ga:hits and the date range is set to just 2 days. Do I understand correctly, this is still a bug in the Core Reporting API and the only real workaround is to use an hour-based filtering?

@MarkEdmondson1234
Copy link
Collaborator

Yes. Its not really a bug more a beta feature that may or may not be supported in the future. Best is to put your client.id into a custom dimension and pull that instead.

@wilfreddesert

This comment has been minimized.

@mgahan-he

This comment has been minimized.

@MarkEdmondson1234

This comment has been minimized.

@wilfreddesert

This comment has been minimized.

@MarkEdmondson1234

This comment has been minimized.

@wilfreddesert

This comment has been minimized.

@paullevchuk
Copy link

The only thing that mostly helps me to overcome this limit was to write my R-code like this:

    df <- tibble(h = 0:23) %>% 
      mutate(h2 = sprintf("ga:hour==%02d", h),
             df_h = map(h2, ~{
               print(glue::glue("GA View: {.id}: {.x}"));
               load_ga_data(.id, metrics, .dims, .x, .start_date, .end_date, max = -1)
               })) %>% 
      unnest(cols = df_h) %>% 
      select(-h, -h2)

@MarkEdmondson1234
Copy link
Collaborator

So looping over hours works?

@wilfreddesert
Copy link

@paullevchuk , I do not speak R, but do I understand correctly, you are using an hour-based filter, collecting data for each hour 0-23, and merging everything?

@paullevchuk
Copy link

paullevchuk commented Feb 19, 2022

So looping over hours works?

From what I learned working with GA API there is a limit: 10K clientId per request call. So switching from day granularity to hour helps me to get accurate data.

But even in such case during the day I have one hour when my data has more than 10K clientId. When that's the case googleAnalyticsR package unfortunately just did only 2 request and I got something like 10001 records.

@paullevchuk
Copy link

paullevchuk commented Feb 19, 2022

@paullevchuk , I do not speak R, but do I understand correctly, you are using an hour-based filter, collecting data for each hour 0-23, and merging everything?

Yes. But this is not a filter, it's dimension

@MarkEdmondson1234
Copy link
Collaborator

It's a similar approach as a comment up in the thread. It's an api issue so the same problem/solution should work with other GA SDKs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests