Scrape ads.txt, app-ads.txt and sellers.json #21

streitl · 2021-11-08T08:53:15Z

Background

ads.txt

ads.txt is a mechanism allowing publishers (a.k.a. website or content owners) to specify who are the parties authorized to sell their inventory (ad spaces/impressions).
Each publisher subdomain can have an ads.txt file, which is just a text file having a list of entries with 3 to 4 fields.
For instance, here's a sneak peek at https://www.lemonde.fr/ads.txt:

#07.12.2020
# Le Monde
appnexus.com, 8253, DIRECT, f5ab79cb980f11d1
appnexus.com, 1608, RESELLER, f5ab79cb980f11d1
appnexus.com, 3500, RESELLER, f5ab79cb980f11d1
appnexus.com, 8494, RESELLER, f5ab79cb980f11d1
appnexus.com, 8499, DIRECT, f5ab79cb980f11d1
appnexus.com, 1314, RESELLER #EBTL
google.com, pub-2366164365855963, RESELLER, f08c47fec0942fa0
google.com, pub-3391936129161967, RESELLER, f08c47fec0942fa0
...

Here's a description of the different fields:

(mandatory) specifies the domain name of an advertisement system
(mandatory) specifies the number of the account that the publisher uses on the advertisement system of field 1
(mandatory) specifies the relationship between the publisher and the advertisement system:
a. DIRECT: there are no intermediaries, and there is likely a contract between the publisher and the advertisement system
b. RESELLER: the publisher has authorized some third party to control its account (field 2) at the system of field 1 and to resell its ad space
(optional) certifies the advertisement system with some certificate authority

app-ads.txt

There is a similar mechanism called app-ads.txt that allows an app owner to specify the parties authorized to sell their inventory.
Basically, the app page on the store that distributes it (e.g. Google Play) points to a domain that contains an app-ads.txt file with the same format as ads.txt.
So for instance, on Google Play the application Twitter specifies that its domain is twitter.com and that there is an app-ads.txt there (so at https://twitter.com/app-ads.txt).

Sellers.json

Another similar mechanism is called Sellers.json, but it is specified by an advertisement system and not by a publisher.
Each advertisement system domain can have a Sellers.json file that lists the publishers and intermediate exchanges that are authorized to sell their inventory through this system.
Note that this is a JSON file, and the interesting entries are those at the key sellers.
For instance, the ad exchange system Xandr has this file at https://www.xandr.com/sellers.json, and it looks like this:

{
  "contact_email": "[email protected]",
  "version": "1.0",
  "identifiers": [
    {
      "name": "TAG-ID",
      "value": "f5ab79cb980f11d1"
    }
  ],
  "sellers": [
    {
      "seller_id": "74",
      "seller_type": "INTERMEDIARY",
      "domain": "pubmatic.com",
      "name": "PubMatic"
    },
    {
      "seller_id": "181",
      "seller_type": "INTERMEDIARY",
      "domain": "google.com",
      "name": "Google AdExchange"
    },
    {
      "seller_id": "226",
      "seller_type": "INTERMEDIARY",
      "domain": "microsoft.com",
      "name": "Microsoft Media Network"
    },
    ...
  ]
}

So in summary, Sellers.json improves transparency for the advertisement systems and helps prevent fraud, while ads.txt protects the inventory of the publishers.

Idea

The extension could benefit a lot from retrieving ads.txt information from the websites visited by the user; and Sellers.json information from the advertisement systems that bid on the observed ads.

First, doing this could allow us to verify (audit) whether the constraints of these specifications are verified (i.e. nobody is selling ad spaces that they are not allowed to).

Also, it could be interesting to keep track of the modifications to these files, as they could reveal interesting insights, e.g. some publisher stops selling ad spaces through a specific advertisement system after it realizes that some of the served ads are from the far-right.

Another idea is to use the "topology" information from ads.txt to better understand the relationship between ad price and user targeting.

More ideas will be added later.

The text was updated successfully, but these errors were encountered:

loleg · 2021-11-10T22:07:08Z

Some quick feedback, without knowing too much about this system, or even industry terminology like "inventory" or "ad exchange". This issue is a collection of ideas - to better define the work, I would break it down into issues with specific tasks as checklists. You could also publish this as concept document, and refer to sections in it inside of the issues. Otherwise the description is clear and readable. With some well defined tasks, I could see this type of project being easily distributed among multiple developers or even crowdsourced in its execution.

The idea makes it seem like the scraping would need to be done simultaneously. However, more pragmatically you could aggregate the data from a set of publishers, then a set of advertisers, then cross-reference them after the fact. In general, scraping should avoid unnecessary repeat visits by caching the data collected, and refreshing as often as needed (which in this case is probably not very often).

It's interesting that you use topology to describe an idea. The web is a graph of nodes and relationships, I could see how you might use graph-based databases here, like neo4j (medium post, moma example) or tigergraph (tor example). So it might conversely to what I wrote above be more effective to not distinguish too much from the publishers and advertisers. Just create nodes and identify what files they serve to classify their data appropriately. You can disaggregate after the fact.

I wouldn't say that ads.txt is a mechanism: it's at best a standard. It would be good to think about not just the need to scrape the data, but also to validate it and help improve it as a service to the community. At least, this is what a good standards body like the W3C would do. Speaking of whom, I believe this is a topic of discussion in the Credible Web Community.

Personally I like the initiative here of using open web data to raise transparency and accountability. Just try to do it in a well organized way and keep a wider perspective to avoid "shooting in the foot".

mvidonne · 2021-11-12T10:08:00Z

@streitlua something like http://corrupt.marketing/?

pdehaye · 2021-11-12T19:28:53Z

Yes but this is not open data.

…

On Fri, Nov 12, 2021, 11:08 Marie-Pierre ***@***.***> wrote: @streitlua <https://github.com/streitlua> something like http://corrupt.marketing/? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAY7MXZXG3PLQFEM727W5S3ULTRQZANCNFSM5HSCWYZA> .

mvidonne · 2021-11-18T08:40:07Z

@streitlua check the statistics for the top 400 French speaking websites done by the CNIL https://linc.cnil.fr/webpub-adstxt-sellersjson/ads_study.html
@fquellec for the dataviz

https://github.com/LINCnil/Ads.txt-et-Sellers.json

under open licence 2.0

foucault-dumas · 2021-12-09T13:01:01Z

@streitlua check the statistics for the top 400 French speaking websites done by the CNIL https://linc.cnil.fr/webpub-adstxt-sellersjson/ads_study.html @fquellec for the dataviz

https://github.com/LINCnil/Ads.txt-et-Sellers.json

under open licence 2.0

See also https://twitter.com/braedon/status/1468746918851272704?s=20

foucault-dumas · 2021-12-23T09:04:17Z

Another interesting tool: https://sellers.guide/

mvidonne · 2021-12-23T09:15:55Z

@ffsinger is there a way to save properly all the info https://sellers.guide/domain/wikistrike.com

ffsinger · 2021-12-23T10:24:43Z

@ffsinger is there a way to save properly all the info https://sellers.guide/domain/wikistrike.com

I haven't been following the discussions on these issues. Would you like to save the analysis result for a specific domain or a list of domains ? Which results specifically ? In a computer-readable or human-readable format ? For what purpose ?

Depending on the answer to these questions, we could build a (more or less complex) scraper.

foucault-dumas · 2022-01-12T18:08:18Z

sellers.guide just added a ads.txt cleaner on their page we just saw an interesting webinar by them with @mvidonne. She summarized it here.

I don't know how to answer @ffsinger's question, the idea is to use ads.txt and sellers.json to add live knowledge on the knowledge harvested on adds by adradar. For example display (and explain) the links between the intermediary identify as the winning bidder for an add and other actors of the adtech ecosystem. Can ads.txt and sellers.json (and sellers.guide) be used to do so?

Edit: sellers.guide's slides

pdehaye · 2022-01-12T21:47:21Z

Looks like it was an interesting seminar!

I am thinking that this needs a strategy session involving @mvidonne, @foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)"

foucault-dumas · 2022-01-13T14:45:22Z

Looks like it was an interesting seminar!

I am thinking that this needs a strategy session involving @mvidonne, @foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)"

Would you send us an invitation? You have the fullest agenda of all

pdehaye · 2022-01-17T08:14:53Z

done

…

On Thu, Jan 13, 2022 at 3:45 PM foucault-dumas ***@***.***> wrote: Looks like it was an interesting seminar! I am thinking that this needs a strategy session involving @mvidonne <https://github.com/mvidonne>, @foucault-dumas <https://github.com/foucault-dumas> and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)" Would you send us an invitation? You have the fullest agenda of all — Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAY7MX2A4PXZF42TKGDIP7LUV3QQ3ANCNFSM5HSCWYZA> . You are receiving this because you commented.Message ID: ***@***.***>

foucault-dumas · 2022-01-17T16:26:59Z

sellers.guide just added a ads.txt cleaner on their page we just saw an interesting webinar by them with @mvidonne. She summarized it here.

I don't know how to answer @ffsinger's question, the idea is to use ads.txt and sellers.json to add live knowledge on the knowledge harvested on adds by adradar. For example display (and explain) the links between the intermediary identify as the winning bidder for an add and other actors of the adtech ecosystem. Can ads.txt and sellers.json (and sellers.guide) be used to do so?

Edit: sellers.guide's slides

@mvidonne also found this very interesting tool, which is like the Markup tool but more complex (and uglier)

foucault-dumas · 2022-06-22T21:07:16Z

Isn't that something we should dig in? https://github.com/InteractiveAdvertisingBureau/openrtb/blob/master/supplychainobject.md

streitl added enhancement New feature or request help wanted Extra attention is needed labels Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape ads.txt, app-ads.txt and sellers.json #21

Scrape ads.txt, app-ads.txt and sellers.json #21

streitl commented Nov 8, 2021 •

edited

Loading

loleg commented Nov 10, 2021

mvidonne commented Nov 12, 2021

pdehaye commented Nov 12, 2021 via email

mvidonne commented Nov 18, 2021 •

edited

Loading

foucault-dumas commented Dec 9, 2021

foucault-dumas commented Dec 23, 2021

mvidonne commented Dec 23, 2021

ffsinger commented Dec 23, 2021 •

edited

Loading

foucault-dumas commented Jan 12, 2022 •

edited

Loading

pdehaye commented Jan 12, 2022 •

edited

Loading

foucault-dumas commented Jan 13, 2022

pdehaye commented Jan 17, 2022 via email

foucault-dumas commented Jan 17, 2022

foucault-dumas commented Jun 22, 2022

Scrape ads.txt, app-ads.txt and sellers.json #21

Scrape ads.txt, app-ads.txt and sellers.json #21

Comments

streitl commented Nov 8, 2021 • edited Loading

Background

ads.txt

app-ads.txt

Sellers.json

Idea

loleg commented Nov 10, 2021

mvidonne commented Nov 12, 2021

pdehaye commented Nov 12, 2021 via email

mvidonne commented Nov 18, 2021 • edited Loading

foucault-dumas commented Dec 9, 2021

foucault-dumas commented Dec 23, 2021

mvidonne commented Dec 23, 2021

ffsinger commented Dec 23, 2021 • edited Loading

foucault-dumas commented Jan 12, 2022 • edited Loading

pdehaye commented Jan 12, 2022 • edited Loading

foucault-dumas commented Jan 13, 2022

pdehaye commented Jan 17, 2022 via email

foucault-dumas commented Jan 17, 2022

foucault-dumas commented Jun 22, 2022

streitl commented Nov 8, 2021 •

edited

Loading

mvidonne commented Nov 18, 2021 •

edited

Loading

ffsinger commented Dec 23, 2021 •

edited

Loading

foucault-dumas commented Jan 12, 2022 •

edited

Loading

pdehaye commented Jan 12, 2022 •

edited

Loading