Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape ads.txt, app-ads.txt and sellers.json #21

Open
streitl opened this issue Nov 8, 2021 · 14 comments
Open

Scrape ads.txt, app-ads.txt and sellers.json #21

streitl opened this issue Nov 8, 2021 · 14 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@streitl
Copy link
Contributor

streitl commented Nov 8, 2021

Background

ads.txt

ads.txt is a mechanism allowing publishers (a.k.a. website or content owners) to specify who are the parties authorized to sell their inventory (ad spaces/impressions).
Each publisher subdomain can have an ads.txt file, which is just a text file having a list of entries with 3 to 4 fields.
For instance, here's a sneak peek at https://www.lemonde.fr/ads.txt:

#07.12.2020
# Le Monde
appnexus.com, 8253, DIRECT, f5ab79cb980f11d1
appnexus.com, 1608, RESELLER, f5ab79cb980f11d1
appnexus.com, 3500, RESELLER, f5ab79cb980f11d1
appnexus.com, 8494, RESELLER, f5ab79cb980f11d1
appnexus.com, 8499, DIRECT, f5ab79cb980f11d1
appnexus.com, 1314, RESELLER #EBTL
google.com, pub-2366164365855963, RESELLER, f08c47fec0942fa0
google.com, pub-3391936129161967, RESELLER, f08c47fec0942fa0
...

Here's a description of the different fields:

  1. (mandatory) specifies the domain name of an advertisement system
  2. (mandatory) specifies the number of the account that the publisher uses on the advertisement system of field 1
  3. (mandatory) specifies the relationship between the publisher and the advertisement system:
    a. DIRECT: there are no intermediaries, and there is likely a contract between the publisher and the advertisement system
    b. RESELLER: the publisher has authorized some third party to control its account (field 2) at the system of field 1 and to resell its ad space
  4. (optional) certifies the advertisement system with some certificate authority

app-ads.txt

There is a similar mechanism called app-ads.txt that allows an app owner to specify the parties authorized to sell their inventory.
Basically, the app page on the store that distributes it (e.g. Google Play) points to a domain that contains an app-ads.txt file with the same format as ads.txt.
So for instance, on Google Play the application Twitter specifies that its domain is twitter.com and that there is an app-ads.txt there (so at https://twitter.com/app-ads.txt).

Sellers.json

Another similar mechanism is called Sellers.json, but it is specified by an advertisement system and not by a publisher.
Each advertisement system domain can have a Sellers.json file that lists the publishers and intermediate exchanges that are authorized to sell their inventory through this system.
Note that this is a JSON file, and the interesting entries are those at the key sellers.
For instance, the ad exchange system Xandr has this file at https://www.xandr.com/sellers.json, and it looks like this:

{
  "contact_email": "[email protected]",
  "version": "1.0",
  "identifiers": [
    {
      "name": "TAG-ID",
      "value": "f5ab79cb980f11d1"
    }
  ],
  "sellers": [
    {
      "seller_id": "74",
      "seller_type": "INTERMEDIARY",
      "domain": "pubmatic.com",
      "name": "PubMatic"
    },
    {
      "seller_id": "181",
      "seller_type": "INTERMEDIARY",
      "domain": "google.com",
      "name": "Google AdExchange"
    },
    {
      "seller_id": "226",
      "seller_type": "INTERMEDIARY",
      "domain": "microsoft.com",
      "name": "Microsoft Media Network"
    },
    ...
  ]
}

So in summary, Sellers.json improves transparency for the advertisement systems and helps prevent fraud, while ads.txt protects the inventory of the publishers.

Idea

The extension could benefit a lot from retrieving ads.txt information from the websites visited by the user; and Sellers.json information from the advertisement systems that bid on the observed ads.

First, doing this could allow us to verify (audit) whether the constraints of these specifications are verified (i.e. nobody is selling ad spaces that they are not allowed to).

Also, it could be interesting to keep track of the modifications to these files, as they could reveal interesting insights, e.g. some publisher stops selling ad spaces through a specific advertisement system after it realizes that some of the served ads are from the far-right.

Another idea is to use the "topology" information from ads.txt to better understand the relationship between ad price and user targeting.

More ideas will be added later.

@streitl streitl added enhancement New feature or request help wanted Extra attention is needed labels Nov 8, 2021
@loleg
Copy link

loleg commented Nov 10, 2021

Some quick feedback, without knowing too much about this system, or even industry terminology like "inventory" or "ad exchange". This issue is a collection of ideas - to better define the work, I would break it down into issues with specific tasks as checklists. You could also publish this as concept document, and refer to sections in it inside of the issues. Otherwise the description is clear and readable. With some well defined tasks, I could see this type of project being easily distributed among multiple developers or even crowdsourced in its execution.

The idea makes it seem like the scraping would need to be done simultaneously. However, more pragmatically you could aggregate the data from a set of publishers, then a set of advertisers, then cross-reference them after the fact. In general, scraping should avoid unnecessary repeat visits by caching the data collected, and refreshing as often as needed (which in this case is probably not very often).

It's interesting that you use topology to describe an idea. The web is a graph of nodes and relationships, I could see how you might use graph-based databases here, like neo4j (medium post, moma example) or tigergraph (tor example). So it might conversely to what I wrote above be more effective to not distinguish too much from the publishers and advertisers. Just create nodes and identify what files they serve to classify their data appropriately. You can disaggregate after the fact.

I wouldn't say that ads.txt is a mechanism: it's at best a standard. It would be good to think about not just the need to scrape the data, but also to validate it and help improve it as a service to the community. At least, this is what a good standards body like the W3C would do. Speaking of whom, I believe this is a topic of discussion in the Credible Web Community.

Personally I like the initiative here of using open web data to raise transparency and accountability. Just try to do it in a well organized way and keep a wider perspective to avoid "shooting in the foot".

@mvidonne
Copy link
Contributor

@streitlua something like http://corrupt.marketing/?

@pdehaye
Copy link

pdehaye commented Nov 12, 2021 via email

@mvidonne
Copy link
Contributor

mvidonne commented Nov 18, 2021

@streitlua check the statistics for the top 400 French speaking websites done by the CNIL https://linc.cnil.fr/webpub-adstxt-sellersjson/ads_study.html
@fquellec for the dataviz

https://github.com/LINCnil/Ads.txt-et-Sellers.json

under open licence 2.0

@foucault-dumas
Copy link
Contributor

@streitlua check the statistics for the top 400 French speaking websites done by the CNIL https://linc.cnil.fr/webpub-adstxt-sellersjson/ads_study.html @fquellec for the dataviz

https://github.com/LINCnil/Ads.txt-et-Sellers.json

under open licence 2.0

See also https://twitter.com/braedon/status/1468746918851272704?s=20

@foucault-dumas
Copy link
Contributor

Another interesting tool: https://sellers.guide/

@mvidonne
Copy link
Contributor

@ffsinger is there a way to save properly all the info https://sellers.guide/domain/wikistrike.com

@ffsinger
Copy link

ffsinger commented Dec 23, 2021

@ffsinger is there a way to save properly all the info https://sellers.guide/domain/wikistrike.com

I haven't been following the discussions on these issues. Would you like to save the analysis result for a specific domain or a list of domains ? Which results specifically ? In a computer-readable or human-readable format ? For what purpose ?

Depending on the answer to these questions, we could build a (more or less complex) scraper.

@foucault-dumas
Copy link
Contributor

foucault-dumas commented Jan 12, 2022

sellers.guide just added a ads.txt cleaner on their page we just saw an interesting webinar by them with @mvidonne. She summarized it here.

I don't know how to answer @ffsinger's question, the idea is to use ads.txt and sellers.json to add live knowledge on the knowledge harvested on adds by adradar. For example display (and explain) the links between the intermediary identify as the winning bidder for an add and other actors of the adtech ecosystem. Can ads.txt and sellers.json (and sellers.guide) be used to do so?

Edit: sellers.guide's slides

@pdehaye
Copy link

pdehaye commented Jan 12, 2022

Looks like it was an interesting seminar!

I am thinking that this needs a strategy session involving @mvidonne, @foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)"

@foucault-dumas
Copy link
Contributor

Looks like it was an interesting seminar!

I am thinking that this needs a strategy session involving @mvidonne, @foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)"

Would you send us an invitation? You have the fullest agenda of all

@pdehaye
Copy link

pdehaye commented Jan 17, 2022 via email

@foucault-dumas
Copy link
Contributor

sellers.guide just added a ads.txt cleaner on their page we just saw an interesting webinar by them with @mvidonne. She summarized it here.

I don't know how to answer @ffsinger's question, the idea is to use ads.txt and sellers.json to add live knowledge on the knowledge harvested on adds by adradar. For example display (and explain) the links between the intermediary identify as the winning bidder for an add and other actors of the adtech ecosystem. Can ads.txt and sellers.json (and sellers.guide) be used to do so?

Edit: sellers.guide's slides

@mvidonne also found this very interesting tool, which is like the Markup tool but more complex (and uglier)

@foucault-dumas
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants