RealBay: A searchable torrent index
Posted: Fri Jan 09, 2015 5:54 am
Hi everyone!
I've been working on a piece of software called "RealBay" recently that indexes torrent data into publishable indexes. The idea is that you trust a list of namecoin identities, so that you only search torrents from publishers you trust. Within the 500 byte record is a DHT hash to lookup that contains the index. The index is organized in such a way where you only have to download a small portion of it to search it. (It contains a bloom filter lookup table within the first piece of the torrent)
Most of my effort so far has been testing the bloom filter indexing, size constraints, and chaining together the indexes so that they can be effeciently searched without downloading much data, while maintaining plausible deniability of search terms. For example, if a torrent contains the word "Ubuntu" it will never actually occur in the index at all (Only a bloom filter representation of multiple words)
Now I've turned my eye to the Namecoin part of the equation. There are a lot of questions:
* I see that there is a prefix to registering addresses, such as d/ and u/ and I see it would be a good idea to discuss with others before I do much here. On the one hand I want to use the existing d/ or u/ prefixes, but I don't think my data maps well to those usages and I don't want people wasting their time trying to resolve addresses that will never exist. I was planning on using the 500 bytes to store a 20-byte hash used to find the index on DHT, and the remaining data to improve search lookup.
* For end users that aren't that concerned about running a "full node" could I use a DNS resolver somewhere? My idea is to let users decide on startup if they want to download the entire Namecoin chain / software or use a remote Namecoin resolver of some kind. Obviously this may be a problem with my custom format of mapping to 20-byte hashes.
* Is this invalid usage of Namecoin? It seems like a good usage, and I considered using BitMessage, but it didn't seem to solve my problems as well. I don't like that I have to download the entire Namecoin chain to get a single result from it, but I think that will be solved over time and isn't principle to the design of Namecoin. Namecoin seems like a good replacement for DNS if the goal is to avoid censorship.
------
For anyone interested in helping or just want to read the code it's still private currently at this Gitlab address:
https://gitlab.com/krisives/realbay/
The reason for it being private right now is because I want to avoid people looking at an unfinished project and writing it off too early. It's coded in Javascript currently using Node-Webkit (or just Node for the publishing tools over the command line) My "last" big problem to solve is with building very large indexes of millions of torrents. The indexes work fine if they are built, but the time to build them is very high.
Thanks for reading and let me know if you wish to get involved in the project before the code is released.
I've been working on a piece of software called "RealBay" recently that indexes torrent data into publishable indexes. The idea is that you trust a list of namecoin identities, so that you only search torrents from publishers you trust. Within the 500 byte record is a DHT hash to lookup that contains the index. The index is organized in such a way where you only have to download a small portion of it to search it. (It contains a bloom filter lookup table within the first piece of the torrent)
Most of my effort so far has been testing the bloom filter indexing, size constraints, and chaining together the indexes so that they can be effeciently searched without downloading much data, while maintaining plausible deniability of search terms. For example, if a torrent contains the word "Ubuntu" it will never actually occur in the index at all (Only a bloom filter representation of multiple words)
Now I've turned my eye to the Namecoin part of the equation. There are a lot of questions:
* I see that there is a prefix to registering addresses, such as d/ and u/ and I see it would be a good idea to discuss with others before I do much here. On the one hand I want to use the existing d/ or u/ prefixes, but I don't think my data maps well to those usages and I don't want people wasting their time trying to resolve addresses that will never exist. I was planning on using the 500 bytes to store a 20-byte hash used to find the index on DHT, and the remaining data to improve search lookup.
* For end users that aren't that concerned about running a "full node" could I use a DNS resolver somewhere? My idea is to let users decide on startup if they want to download the entire Namecoin chain / software or use a remote Namecoin resolver of some kind. Obviously this may be a problem with my custom format of mapping to 20-byte hashes.
* Is this invalid usage of Namecoin? It seems like a good usage, and I considered using BitMessage, but it didn't seem to solve my problems as well. I don't like that I have to download the entire Namecoin chain to get a single result from it, but I think that will be solved over time and isn't principle to the design of Namecoin. Namecoin seems like a good replacement for DNS if the goal is to avoid censorship.
------
For anyone interested in helping or just want to read the code it's still private currently at this Gitlab address:
https://gitlab.com/krisives/realbay/
The reason for it being private right now is because I want to avoid people looking at an unfinished project and writing it off too early. It's coded in Javascript currently using Node-Webkit (or just Node for the publishing tools over the command line) My "last" big problem to solve is with building very large indexes of millions of torrents. The indexes work fine if they are built, but the time to build them is very high.
Thanks for reading and let me know if you wish to get involved in the project before the code is released.