December 2009

Distributed Search Engine, IM and social networking in one bottle

I thought , it would be very useful to create a distributed search engine, or distributed the same social network as an alternative to Google.

First - What I would like to do? I see these goals:

search engine (a lot of areas - searching for files, search the hierarchy of relevance);
Wednesday communication: IM (including audio and video), social network, including - for the communication of people nat / proxy; spam problem;
interest groups (similar to the forums, community or newsgroups);
blogs, hosting, wiki;
security, authentication, encryption;
web interface, no need to put special software by the user.

There are several options for distributed systems.

as STP or mechanism of DR / BDR in OSPF. In fact, centralization with self-chosen center.
as usenet or bgp. Each keeps its own copy, which under normal circumstances should be the same, be synchronized.
as DNS. The hierarchy at the top There are several fixed servers, but not all requests pass through the top.
like jabber or email. The user is tied to a specific server, which indicates to the user.
as Fido. In an address specified routing to a specific detail (networks), and more - is solved on the spot.

I think that it would be nice to exclude, first, the centralization and hierarchy (because it severely limits the scalability with increasing number of nodes is 100 times the load on the central node also increases at least 100 times), in addition, it makes the system more vulnerable. Secondly, the construction bonds must be automatic - computers with similar challenges do better than people (although there are different options). Third, the user should not be tied to a single server. The network must be link-based, as usenet or bgp, and not as jabber or email. Links need to search for and construction of associative links, after finding the desired user or the information requested nodes that are connected, users transmit information directly, regardless of whether the built in between link for communication protocol cross-site exchange.

In such circumstances, the question immediately arises:
1. How to ensure coherence of the system? That is, what to do If links between the two parts of the system has broken, and these parts are no longer related to each other? I think that in this case, the parts must be viable and without a whole, but must "merge" when possible. Think it's wise to use anycast, as well as building a "distant" links, ie links to the most remote nodes of the graph (not geographically remote, and topologically), ie graph should be balanced, not flat.
2. How to protect against compromise of information, ie when someone makes a million virtual servers, issuing him the right rating?

On the one hand, the user name must be unique on the other - names like Serg78243 look crooked, it is better to just digital logins, like in ICQ. Domain hierarchy (as in email) binds to the servers and actually is not justified - it still turns a flat band. com, well, or some plane, it is still the problem of name uniqueness is not solved. There is nothing easier than ICQ, my head does not come: not unique nick plus avtovydavaemy unique digital uin. Only digital uin must be sufficiently large and random, to avoid conflict - because we want to two parts of such a network could exist separately from each other, and then merge together.
Where and how to store information about the user (the password, contact list, lastridy, blog, etc.)? Obviously, at some sites, and what kind of sites should be determined by user login - either on his uin (some hash), or by kukam. That is, if the cookies do not have this information (for example, the user has gone from a new location), the crc16 (uin) define a group of nodes that know the information about all users with the hash, in the sense that they know which sites are responsible for what users. Each node can join any group, and all nodes know that the upper level hierarchy (65536 groups) and receive updates about the changes at this level. Of course, the number 65536 is taken from the ceiling, the number of hierarchical levels and the degree of branching can vary and should do so automatically based on the total number of nodes in the network, the resources required and reliability. User does not know on what node stores information about him, she wanders among the nodes and is abstract in the network. " When sending messages to users who offline, it waits for the sender node. Thus, the user knows that while his favorite server is working fine, no silence his message is not lost.

user searches for different criteria (first / last name / city / age, etc.) - this problem is closely overlaps with the task of implementing a distributed network search engine (an alternative search engine Google). In general, I imagine an algorithm like this: the user types a query such as "short poems about love." Next hop makes the primary semantic analysis of this request, after which defines the group of nodes responsible for these categories, get answers from them, sorts them and produces the result. In this case, he must find some kind of crc of requests "short," "poetry," "love", "short poems," poems about love, for each of these categories to define a group specializing in such sites, and send queries to them, they can drill down on these requests through the hierarchy, giving requests other servers and collecting from them the answers. Here, of course, you need to build on existing developments in the field of neural networks. In addition, each node can gradually Scan nearby (or random, or "your") site web, index it and report the results of those servers who specialize in this (found) information.

Spam. Here, it seems to me, it's simple. I want to receive messages from those who are somehow connected to me like something on my left, somehow I found. Or through mutual friends, or by common interests, or even through some connection. Then it is not spam. If the one who sent me a message in any way with I is not connected - it's spam. Of course, spam - not a binary concept, the message may be spam to a greater or lesser degree. The greater interest in humans, it less than its interest in each of them. The longer the chain of acquaintances, the less credibility to the message. And even better when you send the first message (or Request link-building) to explicitly specify the path location. For example, if I wrote a utility and give my uin for communication on this utility, I file Community (interest) for the utility, respectively, the person to contact with me must first specify your interest in this tool, and then connect with me, pointing out that our common interest, then it will be clear that this is not spam, and how people got my contact. The information about what interests a person should not be publicly available. What does "is stored in an open distributed system and not to be publicly accessible - a very simple, it should be encrypted user-defined password and disclosed only after it was requested.

How to deal with criminals, that is, with nodes that are built into the network and work several different algorithm, falsifying search results (twirling rating), using their own purposes confidential information about users (Including passwords)? Subject to the random distribution of users between the nodes an attacker can not know in advance what users it will serve. Ask for too much, can not serve all the time - it will be less than the request due to the congestion. If he will not give the information that they give parallel nodes (specializing in those same users and same themes) - its credibility fall. And the use of passwords casual users are unlikely interesting. Besides, if you do not lock on http, you can use dsa-authorization (or other keypair), so that even avtorizuya user, you can not have the opportunity to get his password and do anything on his behalf. In general, if we evaluate the reputation of the servers to get that "cheat" rating without losing reputation, will not be easier than to create the same wave livejournal postings from psevdoyuzerov, unwinding or discredit certain sites or companies. (I'm not saying it's impossible).

Of course, the client interface may be different at different sites. Somewhere in the web, with or without ajax, somewhere in your plugin, somewhere very own customer ... The user can choose what he prefers, for reasons of convenience and reliability. Hence, a healthy competition.

What are the thoughts on this? I'm obviously not enough Knowledge about the theory of neural networks. I would be grateful for comments and advice.

Cheaptrip: Dumping On The Field Of Tourism

Wednesday, December 30, 2009

Big Boobs Grope On Subway

Wednesday, December 16, 2009

Clairol Complements Intensifier Blue Reviews

Tuesday, December 15, 2009

Hair Color Chart Koleston

Friday, December 4, 2009

What Kind Of Weave Lala Wears