Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't work at Google so I'm probably way off base, but if I was designing it I wouldn't bother telling the difference between the two types of queries.

I'd break up the indices into digestible chunks, perhaps chronologically by year/month crawled, and then run all queries simultaneously (in parallel) against all those index chunks and combine the results at the end. Infinitely scalable and can be tweaked to ensure specific response times.

And there'd definitely be no need to set some arbitrary date cut-off; just add a few more virtual machines. I'd bet that's what Google was doing, and then scaled back those machines to save money and boost profits.



That's kind of how Google works, with multiple index tiers. Look up patents by Anna Paterson to get a few clues, assuming your lawyers won't bark at you.

Still, you can't keep partial results around forever, unless you want to make searches a lot more expensive, having to add a lot of capacity just to deal with the buffer bloat. Each query touches at least a thousand machines. Adding "a few more virtual machines" isn't going to cut it, especially if you have to handle tens of thousands of requests per second.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: