Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link queue wrapper to record occupancy information in a file #129

Conversation

constraintAutomaton
Copy link
Member

@constraintAutomaton constraintAutomaton commented Mar 26, 2024

Objective

The objective of this actor is to collect information about the link queue and to serialize it into a file. The information collected are the following:

  • The link pushed in the queue
  • The link pop in the queue
  • The timestamp of the links pushed or popped
  • The reachability criteria from which the link was acquire

It is also possible to add custom link queue properties via the IOptionalLinkQueueParameters, but for their parsing the code has to be modified.

A label for each current reachability criteria also has been added. They can optionally label each link coming out of the extract-link actors. This feature doesn't only have the purpose to document the link queue occupancy but can be use in future work for engine optimization such has link queue ordering.

The @comunica/actor-rdf-resolve-hypermedia-links-queue-wrapper-info-occupance:query-identifier property can also be used to identify the query.

Example output

Config of the wrapper actor

{
  "@context": [
    "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/runner/^3.0.0/components/context.jsonld",

    "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-rdf-resolve-hypermedia-links-queue-wrapper-info-occupancy/^0.0.0/components/context.jsonld"
  ],
  "@id": "urn:comunica:default:Runner",
  "@type": "Runner",
  "actors": [
    {
      "@id": "urn:comunica:default:rdf-resolve-hypermedia-links-queue/actors#wrapper-info-occupancy",
      "@type": "ActorRdfResolveHypermediaLinksQueueWrapperInfoOccupancy",
      "beforeActors": { "@id": "urn:comunica:default:rdf-resolve-hypermedia-links-queue/actors#fifo" },
      "mediatorRdfResolveHypermediaLinksQueue": { "@id": "urn:comunica:default:rdf-resolve-hypermedia-links-queue/mediators#main" },
      "filePath": "./link_queue_info.json"
    }
  ]
}

Config of the engine

{
    "@context": [
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/config-query-sparql/^2.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/config-query-sparql-link-traversal/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-extract-links-predicates/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-extract-links-solid-type-index/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-extract-links-quad-pattern-query/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/runner/^3.0.0/components/context.jsonld"
    ],
    "import": [
        "ccqslt:config/config-solid-base.json",
        "ccqslt:config/rdf-resolve-hypermedia-links-queue/actors/wrapper-info-occupancy.json"
    ],
    "@graph": [
        {
            "@id": "urn:comunica:default:Runner",
            "@type": "Runner",
            "actors": [
                {
                    "@id": "urn:comunica:default:extract-links/actors#predicates-common",
                    "@type": "ActorExtractLinksPredicates",
                    "checkSubject": false,
                    "predicateRegexes": [
                        "http://www.w3.org/2000/01/rdf-schema#seeAlso",
                        "http://www.w3.org/2002/07/owl##sameAs",
                        "http://xmlns.com/foaf/0.1/isPrimaryTopicOf"
                    ],
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#predicates-ldp",
                    "@type": "ActorExtractLinksPredicates",
                    "checkSubject": true,
                    "predicateRegexes": [
                        "http://www.w3.org/ns/ldp#contains"
                    ],
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#predicates-solid",
                    "@type": "ActorExtractLinksPredicates",
                    "checkSubject": true,
                    "predicateRegexes": [
                        "http://www.w3.org/ns/pim/space#storage"
                    ],
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#quad-pattern-query",
                    "@type": "ActorExtractLinksQuadPatternQuery",
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#solid-type-index",
                    "@type": "ActorExtractLinksSolidTypeIndex",
                    "inference": false,
                    "mediatorDereferenceRdf": {
                        "@id": "urn:comunica:default:dereference-rdf/mediators#main"
                    },
                    "labelLinksWithReachability":true
                }
            ]
        }
    ]
}

output

{
    "iris_popped": [
        {
            "url": "http://localhost:3000/pods/00000000000000000933/",
            "reachability_criteria": "cSolidStorage",
            "timestamp": 1711439007006
        },
        {
            "url": "http://localhost:3000/www.ldbc.eu/ldbc_socialnet/1.0/data/forum00000001099511627784",
            "reachability_criteria": "cCommon",
            "timestamp": 1711439008917
        }
    ],
    "iris_pushed": [
        {
            "url": "http://localhost:3000/pods/00000000000000000933/",
            "reachability_criteria": "cSolidStorage",
            "timestamp": 1711439007006
        },
        {
            "url": "http://localhost:3000/dbpedia.org/resource/Kelaniya",
            "reachability_criteria": "cMatch",
            "timestamp": 1711439007006
        },
        {
            "url": "http://localhost:3000/www.ldbc.eu/ldbc_socialnet/1.0/data/forum00000001099511627784",
            "reachability_criteria": "cCommon",
            "timestamp": 1711439008916
        }
    ],
    "started_empty": true,
    "query": {
        "type": "project",
        "input": {
            ...
         }
    }
}

If an identifier was given for the query than the query field will be this identifier.

Future works

  • Adding the contribution of each data source to the result bag.
  • Piping the result somewhere else than a file at least giving the option.

@constraintAutomaton constraintAutomaton marked this pull request as draft March 26, 2024 08:12
@constraintAutomaton constraintAutomaton changed the title Link queue wrapper to record occupancy information to a file Link queue wrapper to record occupancy information in a file Mar 26, 2024
@@ -19,7 +27,7 @@ export class ActorExtractLinksAll extends ActorExtractLinks {
return {
links: await ActorExtractLinks.collectStream(action.metadata, (quad, links) => {
for (const link of getNamedNodes(getTerms(quad))) {
links.push({ url: link.value });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach with annotateLinkWithTheReachabilityCriteria and reachabilityLabel seems a bit complicated.

Could we just do something like links.push({ url: link.value, metadata: { actorExtractLinks: this.name } })?

Copy link
Member Author

@constraintAutomaton constraintAutomaton Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not using the name because some actors such has predicate have multiple common variation (common, ldp, solidStorage) and I felt this was it was capture better. Another example would be the type index, maybe we would want to indicate that it is with inference or not or with other actor with notable properties.

Also I wanted to make the annotation optional hence why it is inside a method to not add to much if statement and since it will be same everywhere I placed it inside the abstract class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some actors such has predicate have multiple common variation (common, ldp, solidStorage)

Perhaps those can be passed as additional metadata fields?

The main concern I have with this approach is that all actors have a new (optional) param now, and there is semantic redundancy with the name of the actor.

Copy link
Member Author

@constraintAutomaton constraintAutomaton Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could create a reachability criteria object documenting the specificity of the criteria (with custom equal function for when you care about the "major" reachability type or the full reachability).

For the optional new parameters, the reason it is there is to not modify the previous behavior where the metadata was empty. I defined this new parameter in the bus/abstract class to not disturbed too much the current and future implementation. I could make it so it is always in the links.

/**
* @param args - @defaultNested {<default_bus> a <cc:components/Bus.jsonld#Bus>} bus
*/
public constructor(args: IActorExtractLinksArgs) {
super(args);
this.labelLinksWithReachability = args.labelLinksWithReachability ?? false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need this one. I guess we could just use the actor name this.name?

@rubensworks
Copy link
Member

Before I review in detail, could you @RubenEschauzier first have a look at this as well?
This feels somewhat related to your prioritization work.

@constraintAutomaton
Copy link
Member Author

I don't know why the CI doesn't pass on my end it does. Even the solidbench the one which is the one failing.

@rubensworks
Copy link
Member

I don't know why the CI doesn't pass on my end it does. Even the solidbench the one which is the one failing.

The failure looks unrelated to your PR.

@rubensworks
Copy link
Member

Another thought I have that would simplify this PR: Instead of writing to file, we could just write to the logger (e.g. at TRACE level). An external script could then be used to filter the log messages related to the link queue. This would make this PR work in the browser as well.

@RubenEschauzier
Copy link
Contributor

Looks pretty good, it is a much more refined version of what I used for the link queue analysis paper. I'm unsure if it has much to do with prioritization, as (as far as I can tell) it is mainly for analyzing query execution in hindsight.

@constraintAutomaton
Copy link
Member Author

Looks pretty good, it is a much more refined version of what I used for the link queue analysis paper. I'm unsure if it has much to do with prioritization, as (as far as I can tell) it is mainly for analyzing query execution in hindsight.

I think it is the part about the labeling of the links with the reachability criteria. At least for my work I plan to use those label to prioritize some links.

@constraintAutomaton
Copy link
Member Author

Another thought I have that would simplify this PR: Instead of writing to file, we could just write to the logger (e.g. at TRACE level). An external script could then be used to filter the log messages related to the link queue. This would make this PR work in the browser as well.

Yes, I think it is much better! I will be also able to provide the information in a streaming manner (instead of dumping it all at every step, but I think I will provide the option to do so). I will make another repo to parse this information has a file because it make it easy for my own investigation and debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants