-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artefact URLs #15
base: develop
Are you sure you want to change the base?
Artefact URLs #15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for this PR, @MeltedHyperion7 . You obviously put a lot of time and thought into it.
Apologies that it has taken me so long to complete the review. I actually started the review ages ago, and was about to add the review, but something about the code was niggling at me. Then I realised that there was the possibility to get the URLs from the provider JSON, rather than having to pass in the project URLs. Having finally figured that out (and convinced myself it was possible in all cases) I could complete the review.
There are quite a few comments to look at I'm afraid. In some cases, the changes you've made helpfully suggest further improvements. I've noted those, but have suggested where things should be fixed for this PR, and where things can be booted into the future as an issue and/or a future PR (by you or someone else).
Thanks for any further work you are able to do on this. If I've been too slow and your life has moved on, let me know and I can pick things up from here (or maybe someone else would like to).
repo = "https://github.com/robota-suite/robota-core" | ||
project = "robota-suite/robota-core" | ||
|
||
def test_branch_url(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see that you've added a test case for that. That's really helpful.
Would it be possible to have a test for the diff behaviour too? Or is there some complication I've not foreseen about that.
Feel free to do this in another PR, or just create an issue and leave it for someone else if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, both Github and Gitlab require an API token for anything more complicated than reading off branch names.
I don't see any easy solutions. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for getting back to me on this, after such a long time, @MeltedHyperion7 .
Is a token needed to access the compare diff through the REST API on a public project? The test projects I set up are public. I tried to access the diff from the test GitLab repo (mentioned later in the review) in an incognito browser window and got a non-empty JSON result back. Does the API call not work for you?
https://gitlab.com/api/v4/projects/60747754/repository/compare?from=INITIAL_COMMIT&to=main
I tried the same thing with GitHub and was able to see the JSON result:
https://api.github.com/repos/robota-suite/test-public-repo/compare/INITIAL_COMMIT...main
But maybe you're telling me that the Python libraries we're using to access these APIs need a token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking this. I think it fails because the code is currently set up to attempt authentication every time and fails if no token is provided.
Should I open an issue to rework the code to work without a token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - right! So it's a robota-core thing! Thanks for pointing this out.
Yes - we do want robota-core to be useful for analysing public repositories, so making an issue that requests that feature would be great.
We can also add an issue for creating a test for the diff functionality, and I can set this thread to be resolved once that's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I should have said that we can still write the test for the diff, as we can add a config file with the tokens needed to authenticate, with the secrets being set locally to run the tests in the development environment and through a CI secret for running the tests through GH Actions.)
robota_core/repository.py
Outdated
""" | ||
def __init__(self, branch, source: str): | ||
def __init__(self, branch, source: str, project_url: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you've added this optional parameter - as a way to create full URLs from the branch name alone.
I'm wondering whether we actually need this, though? We create the RoboTA branch object from the JSON returned from the hosted project (whether from GitHub or GitLab) and this should include a full link to the artefact requested, which I think you can just take and use in the creation of the RoboTA branch object.
For example, see the JSON returned from the GitLab Branches API:
https://docs.gitlab.com/ee/api/branches.html#get-single-repository-branch
Notice in the example response the web_url
parameter.
The GitHub response has the same:
https://docs.gitlab.com/ee/api/branches.html#get-single-repository-branch
but under "_links"/"html" in this case.
Do you think we can eliminate the need for the project_url
parameter given this? Obviously, we won't have it for the local repo branches, but then we don't try to provide a URL for those anyway. But it would be helpful if we could get rid of the need for the parameter, not least because I'm not sure it would always be provided with the code as it currently is.
(I suspect this same fix might be able to be applied to other artefacts to. But let's not try to do that in this PR. You've waited long enough for this review. Let's get this done first! :-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
project_url
here is used to pass down the web_url
parameter received from the server! I made it optional so that if it isn't provided, which would be the case for a local repo, url
is simply set to None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the project URL is already embedded in the web_url attribute of the specific objects we are retrieving, and is already in the right format. So the work you are currently doing on line 51 of your version of repository.py
is not needed, because the full URL (including the project URL) is already made for you. E.g. see the JSON from:
https://gitlab.com/api/v4/projects/60747754/repository/branches/main
which is:
{
"name": "main",
"commit": {
"id": "61ba593d2ce0884dfe7516dd4253b3d51a0543dd",
"short_id": "61ba593d",
"created_at": "2024-08-10T23:23:13.000+01:00",
"parent_ids": [
"273d16fd7bd370af2cb74609b3f899d50e3235b7"
],
"title": "Provide more informative README",
"message": "Provide more informative README\n",
"author_name": "XXX",
"author_email": "XXX",
"authored_date": "2024-08-10T23:23:13.000+01:00",
"committer_name": "XXX",
"committer_email": "XXX",
"committed_date": "2024-08-10T23:23:13.000+01:00",
"trailers": {
},
"extended_trailers": {
},
"web_url": "https://gitlab.com/robota-suite/test-public-repo/-/commit/61ba593d2ce0884dfe7516dd4253b3d51a0543dd"
},
"merged": false,
"protected": true,
"developers_can_push": false,
"developers_can_merge": false,
"can_push": true,
"default": true,
"web_url": "https://gitlab.com/robota-suite/test-public-repo/-/tree/main" <-- HERE
}
This last component of the JSON (the web_url
attribute) contains the URL ready made, in the same form as the link your code is constructing in line 51 (shown just after my comment).
I'm suggesting we use this URL, and then we don't need anyone to pass the project's URL to this bit of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I understand now. This works fine for Gitlab, but Github does not have a web_url
equivalent on the 'list all branches' endpoint. _links.html
is only returned when querying a single branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using the "list all branches" endpoint here? If so, we can just issue another call to get the information for the specific named branch. That should have the information needed, I think?
self.url = f"{project_url}/-/tree/{branch.attributes['name']}" | ||
else: | ||
self.url = None | ||
|
||
def _branch_from_dict(self, branch: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creation of a branch object from a caller provided dictionary is a feature we use to facilitate testing of the various RoboTA Core code. It allows us to quickly create an artefact (as though it was returned from some hosted project but without the bother of making that actually happen) to test the later processing with it.
I guess we should allow the dictionary to optionally include a URL for the branch? But if we do this, we should probably make the same change for all the _X_from_dict
methods, rather than just doing it for one? So probably better to leave that for another pull request? One or the other of us should make an issue about that, as a reminder. Happy to do it, unless you prefer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll remove this and make and raise an issue.
robota_core/repository.py
Outdated
@@ -107,35 +127,41 @@ def _event_from_dict(self, event_data: dict): | |||
class Diff: | |||
"""A representation of a git diff between two points in time for a single | |||
file in a git repository.""" | |||
def __init__(self, diff_info, diff_source: str): | |||
def __init__(self, diff_info, diff_source: str, diff_url: str = None, page: int = 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with branches, if we can get the URL for the diff from the artefact returned from the host provider, that would be great. I think RoboTA Core at the moment only returns diffs from comparisons between two commits? If so, both GitHub and GitLab provide a diff URL in their JSON responses.
If we sometimes use this Diff class to talk about individual commit diffs, I think we can still get a URL from the provider, but it may be a little trickier, as sometimes the URL needs the context from which the Diff was extracted.
I think for this artefact it may be better to assume that a url
key might exist within the diff_info object. If it is present, then we use its value to populate the url
attribute of the Diff
instance and if it is not we leave it blank? This is just the same as what you currently have, except that the optional URL information is passed in the diff_info dict instead of as a parameter?
I can then look at the other RoboTA tools (still on our local GitLab instance for now) and see if we can get this url
key/value pair added in where the information is available?
What do you think?
This should not be a breaking change because we don't insist that the diff_info
dict has such a key/value
robota_core/repository.py
Outdated
|
||
if diff_source == "gitlab": | ||
self._diff_from_gitlab(diff_info) | ||
self._diff_from_gitlab(diff_info, diff_url=diff_url, page=page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you let me know why the page info is needed here but not for other artefacts? I'm sure you had a good reason for adding it. But since it is different from the other artefacts, it would be useful to know what that was. Thanks!
(No problem if you wrote this PR so long ago that you've now forgotten why you did it. That is my fault for how long it's taken me to get to this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because Gitlab paginates long diffs but Github doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - thanks. That's useful to know.
Since the point of robota-core is to give a common interface to hosted Git projects, hiding differences like this between the providers, I think it would be better if we handled the paging entirely within this method and didn't expose the paging to callers.
You can ask GitLab to provide the full result, without paging. I think it would be better to use that form of the API call, so that the caller gets the full result regardless of whether they are looking at commit diffs hosted in GitLab or GitHub, even at the risk of the result being too big for the memory in some cases. Does that sound do-able?
robota_core/repository.py
Outdated
@@ -147,6 +173,9 @@ def _diff_from_github(self, diff_info: github.File.File): | |||
self.new_file = False | |||
self.diff = diff_info.patch | |||
|
|||
if diff_url is not None: | |||
# no way to link to the exact file in github UI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the helpful comment here - do you mean that the GitHub page for a diff doesn't include anchors for the individual files on the page describing the diff of a commit?
I think it does allow linking to diffs on specific files. Here's an example branch comparison on the RoboTA-Core repo:
We can link to the pyproject.toml changes with this link:
master...develop#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711
The same applies to diffs for individual commits. For example, see this RoboTA Core commit:
The commit changes two files and the links to the specific changes in each file are:
e5d9eec#diff-1ae2d250ab6fedfba4e417836ce1d6b32bc32a8da57f37f8dee5899719851fdb
e5d9eec#diff-da145de666e658b90b4b5568fbc5d8022fdc84dc6e1cb6cbb13b9337c5fb74ec
In each case, the diff of each file seems to have its own hash. I'm not sure where these hashes come from, though. We'd need to investigate that a bit further. Hopefully, they are buried somewhere in the JSON returned about the diff?
For now, I suggest adjusting the comment to reflect this possibility (if you agree it is one) and then making an issue describing the possible future change to put these links in place? That way, this PR isn't blocked until we figure this one out. To be honest, we don't make much use of diffs in the other RoboTA Suite tools at the moment, so I think these links are a lower priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashes are SHA256 of the file names. I remember there being some edge case where the anchors did not work, however, and I had to drop the entire thing. I'll try to figure out what the issue was, probably something to do with very large diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh great - so we can get the hashes. That's super useful to know, thanks.
If you can manage to get back to those edge cases, that would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does clicking on the pyproject.toml
link direct you to the file? When I click it, it directs me to the commits tab rather than the file. Same with other files I've tried. The two commit links work fine.
robota_core/repository.py
Outdated
@@ -393,7 +422,8 @@ def get_file_contents(self, file_path: str, branch: str = "master") -> Union[byt | |||
|
|||
def compare(self, point_1: str, point_2: str) -> List[Diff]: | |||
comparison = self.repo.compare(point_1, point_2) | |||
return [Diff(diff, "github") for diff in comparison.files] | |||
diff_url = f'{self.project_url}/compare/{point_1}...{point_2}#files_bucket' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following on from the comments on getting the whole URL from the JSON provided by the hosting site, we have the same possibility here, I think. The GitHub comparison API contains a couple of links that can be used to get to the diff created by comparing two commits. See the example response at:
https://docs.github.com/en/rest/commits/commits?apiVersion=2022-11-28#compare-two-commits
robota_core/repository.py
Outdated
robota_diffs = [Diff(diff, "gitlab") for diff in gitlab_diffs["diffs"]] | ||
|
||
# base url for the diff | ||
diff_url = f'{self.project_url}/-/compare/{point_1}...{point_2}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GitLab API returns a web_url
key/value pair in the JSON from the compare API call. See the example response at:
https://docs.gitlab.com/ee/api/repositories.html#compare-branches-tags-or-commits
Using this will insulate the code against changes GitLab may make in its API JSON formats (which does happen - there was a big change last summer when we moved to a version of GitLab that added an extra component into almost all URLs, which caused me a lot of headaches where we did not use the provided URL and was easy as pie where we did).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with web_url
is that it links to a page with the raw command output instead of the UI. Example.
It's all good if this is acceptable. It would save you from a lot of head-scratching in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ... the keys in the GitLab JSON and the GitHub JSON vary quite a bit. For the GitHub compare commits JSON, you need to use the "html_url" key value pair.
robota_core/repository.py
Outdated
diff_url = f'{self.project_url}/-/compare/{point_1}...{point_2}' | ||
|
||
# we calculate the page in the diff that this particular file will be on | ||
robota_diffs = [Diff(diff, "gitlab", diff_url=diff_url, page=(i // GITLAB_DIFF_FILES_PER_PAGE + 1)) for i, diff in enumerate(gitlab_diffs["diffs"])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will probably come back to bite us, but we have tended to just return everything for RoboTA Core without worrying about paging. So one option would be for you to do the same here, and then we can look in a more systematic way across the code base to bring paging in properly? Happy to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paging is only used on Gitlab diffs. Do you propose we simply link to the diff page and not worry about the anchor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would ask GitLab to return the full result without paging, and then we are guaranteed to find the anchor in what is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a way around the pagination on the web UI. I found an issue tracking this.
from robota_core.repository import GitlabRepository, GithubRepository | ||
|
||
class TestGithubUrls: | ||
repo = "https://github.com/robota-suite/robota-core" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was clever to use the robota-core repository itself for this as we can guarantee that it exists and has the branch named. Your approach made me think about how we should approach this kind of testing - testing again real repo data rather than just whipping up the dicts in the test. You're right that we should do some acceptance testing against real repos.
So I made a couple of test repos for this, one in GitHub and one in GitLab:
https://github.com/robota-suite/test-public-repo
https://gitlab.com/robota-suite/test-public-repo
I suggest we leave this PR as it is, but then create an issue (or issues) for moving tests like this onto the
more stable test repos. I think this should give good test coverage alongside the current dict-based tests
that we use now.
For info, @MeltedHyperion7 , there have been quite a few changes to the code base and particularly the CI since this PR was started. So the CI results may not be accurate until after the merge is made. |
Retrieve the Github and Gitlab branch URLs from the API responses instead of constructing them manually.
@suzanneEmbury I've added some changes based on our discussion. Can you have a look? |
Fixes #2
Adds a
url
field to the following artefacts:Branch
Diff