-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape ads.txt, app-ads.txt and sellers.json #21
Comments
Some quick feedback, without knowing too much about this system, or even industry terminology like "inventory" or "ad exchange". This issue is a collection of ideas - to better define the work, I would break it down into issues with specific tasks as checklists. You could also publish this as concept document, and refer to sections in it inside of the issues. Otherwise the description is clear and readable. With some well defined tasks, I could see this type of project being easily distributed among multiple developers or even crowdsourced in its execution. The idea makes it seem like the scraping would need to be done simultaneously. However, more pragmatically you could aggregate the data from a set of publishers, then a set of advertisers, then cross-reference them after the fact. In general, scraping should avoid unnecessary repeat visits by caching the data collected, and refreshing as often as needed (which in this case is probably not very often). It's interesting that you use topology to describe an idea. The web is a graph of nodes and relationships, I could see how you might use graph-based databases here, like neo4j (medium post, moma example) or tigergraph (tor example). So it might conversely to what I wrote above be more effective to not distinguish too much from the publishers and advertisers. Just create nodes and identify what files they serve to classify their data appropriately. You can disaggregate after the fact. I wouldn't say that ads.txt is a mechanism: it's at best a standard. It would be good to think about not just the need to scrape the data, but also to validate it and help improve it as a service to the community. At least, this is what a good standards body like the W3C would do. Speaking of whom, I believe this is a topic of discussion in the Credible Web Community. Personally I like the initiative here of using open web data to raise transparency and accountability. Just try to do it in a well organized way and keep a wider perspective to avoid "shooting in the foot". |
@streitlua something like http://corrupt.marketing/? |
Yes but this is not open data.
…On Fri, Nov 12, 2021, 11:08 Marie-Pierre ***@***.***> wrote:
@streitlua <https://github.com/streitlua> something like
http://corrupt.marketing/?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAY7MXZXG3PLQFEM727W5S3ULTRQZANCNFSM5HSCWYZA>
.
|
@streitlua check the statistics for the top 400 French speaking websites done by the CNIL https://linc.cnil.fr/webpub-adstxt-sellersjson/ads_study.html https://github.com/LINCnil/Ads.txt-et-Sellers.json under open licence 2.0 |
See also https://twitter.com/braedon/status/1468746918851272704?s=20 |
Another interesting tool: https://sellers.guide/ |
@ffsinger is there a way to save properly all the info https://sellers.guide/domain/wikistrike.com |
I haven't been following the discussions on these issues. Would you like to save the analysis result for a specific domain or a list of domains ? Which results specifically ? In a computer-readable or human-readable format ? For what purpose ? Depending on the answer to these questions, we could build a (more or less complex) scraper. |
sellers.guide just added a ads.txt cleaner on their page we just saw an interesting webinar by them with @mvidonne. She summarized it here. I don't know how to answer @ffsinger's question, the idea is to use ads.txt and sellers.json to add live knowledge on the knowledge harvested on adds by adradar. For example display (and explain) the links between the intermediary identify as the winning bidder for an add and other actors of the adtech ecosystem. Can ads.txt and sellers.json (and sellers.guide) be used to do so? Edit: sellers.guide's slides |
Looks like it was an interesting seminar! I am thinking that this needs a strategy session involving @mvidonne, @foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)" |
Would you send us an invitation? You have the fullest agenda of all |
done
…On Thu, Jan 13, 2022 at 3:45 PM foucault-dumas ***@***.***> wrote:
Looks like it was an interesting seminar!
I am thinking that this needs a strategy session involving @mvidonne
<https://github.com/mvidonne>, @foucault-dumas
<https://github.com/foucault-dumas> and myself, before involving
developers more? Along the lines of "AdRadar needs to evolve towards
helping find out during regular web browsing interesting situations for
which it is worth investigating further the data angle (possibly through
SARs)"
Would you send us an invitation? You have the fullest agenda of all
—
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAY7MX2A4PXZF42TKGDIP7LUV3QQ3ANCNFSM5HSCWYZA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@mvidonne also found this very interesting tool, which is like the Markup tool but more complex (and uglier) |
Isn't that something we should dig in? https://github.com/InteractiveAdvertisingBureau/openrtb/blob/master/supplychainobject.md |
Background
ads.txt
ads.txt is a mechanism allowing publishers (a.k.a. website or content owners) to specify who are the parties authorized to sell their inventory (ad spaces/impressions).
Each publisher subdomain can have an
ads.txt
file, which is just a text file having a list of entries with 3 to 4 fields.For instance, here's a sneak peek at https://www.lemonde.fr/ads.txt:
Here's a description of the different fields:
a. DIRECT: there are no intermediaries, and there is likely a contract between the publisher and the advertisement system
b. RESELLER: the publisher has authorized some third party to control its account (field 2) at the system of field 1 and to resell its ad space
app-ads.txt
There is a similar mechanism called app-ads.txt that allows an app owner to specify the parties authorized to sell their inventory.
Basically, the app page on the store that distributes it (e.g. Google Play) points to a domain that contains an
app-ads.txt
file with the same format asads.txt
.So for instance, on Google Play the application Twitter specifies that its domain is twitter.com and that there is an
app-ads.txt
there (so at https://twitter.com/app-ads.txt).Sellers.json
Another similar mechanism is called Sellers.json, but it is specified by an advertisement system and not by a publisher.
Each advertisement system domain can have a
Sellers.json
file that lists the publishers and intermediate exchanges that are authorized to sell their inventory through this system.Note that this is a JSON file, and the interesting entries are those at the key
sellers
.For instance, the ad exchange system Xandr has this file at https://www.xandr.com/sellers.json, and it looks like this:
So in summary,
Sellers.json
improves transparency for the advertisement systems and helps prevent fraud, whileads.txt
protects the inventory of the publishers.Idea
The extension could benefit a lot from retrieving
ads.txt
information from the websites visited by the user; andSellers.json
information from the advertisement systems that bid on the observed ads.First, doing this could allow us to verify (audit) whether the constraints of these specifications are verified (i.e. nobody is selling ad spaces that they are not allowed to).
Also, it could be interesting to keep track of the modifications to these files, as they could reveal interesting insights, e.g. some publisher stops selling ad spaces through a specific advertisement system after it realizes that some of the served ads are from the far-right.
Another idea is to use the "topology" information from
ads.txt
to better understand the relationship between ad price and user targeting.More ideas will be added later.
The text was updated successfully, but these errors were encountered: