Implement a persistently ordered list of info hashes to be consumed by the index/API (for DashMap torrent repository) #788
Replies: 17 comments
-
Relates to: #565 The new tracker repository implementation does not keep order for torrents so the API would return random torrents. We need to keep the order in a different (new) data structure. |
Beta Was this translation helpful? Give feedback.
-
I think we need to revisit if we actually need the tracker API to have the In the last meeting we decided that we didn't want a breaking change by merging this PR that implements the new DashMap torrent repository. As it would break the ordering of torrents, which was supposedly needed for the Torrust index when it calls this api endpoint with pagination and such. Why do we need itAfter looking into the index. I could actually not find any place where the tracker api endpoint Am I missing something? Do you know when the index would have to go through all the tracker torrents @josecelano ? If the feature is not used by the tracker nor the index, shouldn't we just drop it? How it currently worksCurrently we keep all torrents in a However, even in the current implementation the torrent order is not always the same in between calls to the How it should workIf we somehow still need this feature for anything (could actually still be useful for testing), I think we should just return the entire detailed torrents map in one go. Yes it will be a big response if the torrent map is huge, but I don't see any other way to safely return all torrents. Meaning without any missing ones. The torrent map is just too dynamic. Even without sharding. Thoughts @josecelano @da2ce7 ? |
Beta Was this translation helpful? Give feedback.
-
Hi @WarmBeer I think you are right. That endpoint is not used in the Index for the time being. However, I think it's a pretty useful endpoint for testing. I use it a lot. Besides, the API is supposed to be generic (for purposes we don't even know), not only consumed by the Index. So it should be designed to fulfill most of the requirements other users might have. And I think that's an important endpoint. I don't see any problem in returning a partial view of the data. I guess that's what always happens when you do pagination. If you want we could add a Regarding the option |
Beta Was this translation helpful? Give feedback.
-
During our last meeting we discussed a few potential solutions to solve this problem: Return the full repository of torrents without paginationWould just return the entire torrent repository on request. Pros:
Cons:
Paginated and limited (current implementation)Allows users to set a Pros:
Cons:
Paginated and limited, but with a torrent repository persistency checkWorks the same as above, but also returns a unique identifier on request that represents the current state of the torrent repository. The identifier would be held in a Pros:
Cons:
Paginated and limited and also persist insertion order of all torrent info hashes in a separate lock free set (the solution we settled on)Add an additional struct that will keep of the insertion order of all torrents, even removed ones. Use a lock free set like BoxCar for this struct so that we do not force all threads to content for the same write lock after announcing new torrents. Works the same as the current paginated and limited implementation for the rest, except that the order of torrents will come from this new insertion order struct instead of the torrent repository map itself. Removed torrents will be included in the JSON response, but with their value set to Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
Hi @WarmBeer thank you for the summary. I have a question: doesn't the |
Beta Was this translation helpful? Give feedback.
-
Hi @josecelano , The memory will indeed always increase over time for every newly announced torrent. But we can not remove any info hash because it will change the torrent pagination. |
Beta Was this translation helpful? Give feedback.
-
:-/ am I missing something? @WarmBeer I guess we should "clean" the data structure from time to time or when we reach a size limit at least. I think it does not make sense to but introduce a new data structure that can grow indefinitely. |
Beta Was this translation helpful? Give feedback.
-
Hi @josecelano , Cleaning this data structure would defeat the whole purpose of it. We need this to persist the order of all torrents so that we can offer torrent pagination for this API endpoint But yeah, it is a bit cross indeed that we now introduce a new data structure that can keep growing in memory. We could write it to disk/database instead of keeping it in memory, but that will slow down the tracker as such an operation is very slow. I'm back to thinking that we should just drop the pagination and return the entire torrent map at once 😅. Even though this can make the response very big and slow if you have millions of torrents. I see this as the only way to reliably send all torrents back to the client without sacrificing performance in the tracker. Perhaps we should think about why people would need this endpoint so that we can think of other solutions that would fit their needs and if returning a in the worst case big and slow response is a problem for them. We already know that it is useful for testing. Use casesTestingIt is an easy way to test if a tracker contains all the torrents you would expect it to have. ScrapingExternal services could easily scrape all torrents from the tracker. Doing it like this is not standard protocol however. |
Beta Was this translation helpful? Give feedback.
-
Hi @WarmBeer. I think, in the end, the Maybe the core tracker should have its own optimized in-memory data structure and that should not be directly connected to anything to avoid performance problems. The API could be connected only to the persisted data. We could update the database directly from services. When we receive an announce request we update the core tracker but also the DB. If we use a data structure like this: or even event sourcing, writes should be fast. There would be only performance problems if there were a lot of request for the API. But you have the option to disable the API. In the future the API could be a decouple app that listens to events sent by the tracker. cc @da2ce7 |
Beta Was this translation helpful? Give feedback.
-
@josecelano @WarmBeer
This is memory efficient. Yes is it is a forever growing set; however only a small amount of data is kept on the stack per entry. |
Beta Was this translation helpful? Give feedback.
-
Hi @WarmBeer, @da2ce7 and I were discussing in today and supposing the 100_000_000 torrents -> 800_000_000 bytes -> 800 MB Our current live demo server has 1GB of RAM but 800 MB should not be a problem for a "normal" server. Besides, if you restart the service it starts from zero again. |
Beta Was this translation helpful? Give feedback.
-
Hi @josecelano , @da2ce7 , I just realized there is another perhaps cleaner solution. This is our current structure: pub struct RepositoryDashmap {
pub torrents: DashMap<InfoHash, Entry>,
pub shard_priority_list: Vec<Mutex<VecDeque<InfoHash>>>,
}
pub struct Entry {
pub peers: PeersList,
pub completed: u32,
}
pub type PersistedTorrents = BoxCar<Weak<InfoHash>>; What if we renamed the current ...
pub struct Entry {
// Contains the old `Entry` struct.
pub torrent_details: Option<TorrentDetails>,
pub time_inserted_in_nanos: u64,
}
pub struct TorrentDetails {
pub peers: PeersList,
pub completed: u32,
} Instead of culling the entire DashMap shard entry of a torrent whenever it is the lowest priority and the TorrentRepository needs extra space because it reached the memory limit. We instead just drop |
Beta Was this translation helpful? Give feedback.
-
Hi @WarmBeer It's not clear to me what you want to achieve. I guess you want to get rid of the secondary data structure Assuming you want to remove the
|
Beta Was this translation helpful? Give feedback.
-
Hi @josecelano , This is just a more efficient implementation of the Pros
Cons
Side noteThe original
Currently no API endpoint needs write access to the torrent repository as far as I'm aware. But even if we need it in the future, I do not see a problem. The dashmap implementation is a very scalable solution for concurrent write access to the torrent repository. The API would just be another interface to read and modify the torrent repository like the current UDP and HTTP tracker client interfaces are.
Yes this is true. I would be slower than grabbing the latest inserted info hash from a stack for example.
Yes, but that would be no different than the original If the memory limit is never reached, this won't be a problem at all.
The persistent order of torrents will be decided by the original insertion time in nano seconds that would be added to the torrent entries of the torrent repository by this proposal.
Solely to persist the torrent order for the torrent API endpoint. The entry would only contain the info hash and the insertion time. |
Beta Was this translation helpful? Give feedback.
-
Hi @WarmBeer, thank you! I think this is a better solution because:
I think we could add an announce and scrape endpoints, but even without those endpoints the API needs to get the torrents. In The BoxCar solution, the API does not need to use the main repository because InfoHashes are duplicated in the BoxCar. All requests to the Finally, I don't see how you are going to build the pagination for the |
Beta Was this translation helpful? Give feedback.
-
Hi @josecelano ,
The
Torrents would be sorted on the fly per API request. In spaghetti Rust code:
|
Beta Was this translation helpful? Give feedback.
-
Hi @WarmBeer,
That's true. I didn't remember that.
If I and ChatGTP are not wrong that means you need to: In Rust, However, if you need to sort the items stored in a
Here's a basic example of how to do this: use dashmap::DashMap;
use std::collections::BTreeMap;
fn main() {
// Assume dashmap is already populated
let dashmap = DashMap::new();
dashmap.insert("c", 3);
dashmap.insert("a", 1);
dashmap.insert("b", 2);
// Extract and sort items
let mut items: Vec<_> = dashmap.iter().map(|entry| (*entry.key(), *entry.value())).collect();
items.sort_by(|a, b| a.0.cmp(&b.0)); // Sorting by keys, you can sort by values or custom logic
// Alternatively, directly insert into a BTreeMap to keep it sorted
let mut sorted_map = BTreeMap::new();
for (key, value) in items {
sorted_map.insert(key, value);
}
// Iterating over the sorted items
for (key, value) in sorted_map {
println!("{}: {}", key, value);
}
} In this example:
Remember, this process of sorting is not concurrent and does not benefit from the concurrent nature of |
Beta Was this translation helpful? Give feedback.
-
UPDATE by @josecelano
Relates to: #784
The DashMap torrent repository implementation can only be used in production if it keeps torrents ordered.
For now, it's only used for benchmarking. In order to use it in production this should be implemented.
Beta Was this translation helpful? Give feedback.
All reactions