This author is in his own little bubble and doesn't understand the vast amount of blog-repost spam that google has to deal with. The way their algorithm most likely deals with this is a mixture of domain rank + tenure... how long has this copy of this article existed on this domain, and can we be sure this is the original copy?
The author says the article was removed in 2006 ("[...] posts, were not accessible anymore") and then he re-posted the article at a new domain in 2013. That means any copy/crawl/repost of the article from 2006-2012 is now the oldest living, and thus "original", version of the article. His 2013 repost was seen as just another blog-spam copy.
Google is not forgetting the old web unless we see evidence of content disappearing from the index that have been consistently hosted at the same domain & URL since their original posts. Unless you properly 301 your URLs to new locations and consistently host your content, it's a guessing game for the crawler to determine where the original content has moved to.
I checked the wayback machine and the content has constantly been on that url for over 10 years.
This is the first example of an old forum page I tried after reading the article. So I tend to think it's true. Google is discarding the "classic" web.
Anecdotally and perhaps unrelated - has anyone else noticed a decrease in the accuracy and general quality of Google search over the past 2-4 months? They've had to have been utilizing ML to 'improve' searches for some time now, but the quality of the results has decreased suddenly and inexplicably (for me).
It all started going downhill since Google's "Hummingbird" switch to be honest. Interviewing for Google, I actually brought this up with an engineer in the search team during the lunch.
He said they haven't noticed any regressions. I said I figured that would be the case but I can definitely feel the difference as a daily user.
This is indicative of a larger issue - testing is probably as difficult as solving the halting problem i.e. code could be generated from proper tests, yet teams tend to trust their tests completely. I see high profile websites having severe usability issues or being outright broken in ways that would be immediately caught by "interns randomly click here and there" usability tests. But these version got deployed probably because testing did not show any regressions.
I tend to believe that if user complaints about new problems or regressions increase over statistical noise - there is a problem.
I noticed the same. I've wondered for years why it happened and sometimes when I'm frustrated I try to think about it. But I am not entirely sure that the degradation in search results for me happened only in the past 5 years. Maybe, but I'm not sure.
In 2009 Google was amazing for diagnosing Linux issues. I would just copy the error from the console and I'd have links to the issue tracker, a work around and the version in which the bug was fixed. Today I get a link to some github project that has nothing to do with what I'm working on and was closed as being an upstream issue.
I don't have the time, money or energy to build a specific crawler, but a Linux search engine that indexed all the major distros, packages, mailing lists, forums and issue trackers would be amazing.
I had assumed that Google search had gone downhill because it started trying to "personalize" my search results. That wasn't a great explanation though, as I don't use a Google account.
I do a full clear on my web browser (cookies / offline storage / history, everything) and then open YouTube in a private browsing window and it asks me which of my two Gmail accounts I want to log in with. I'd guess it's just a combo of external IP and browser fingerprint, but it's creepy.
I know they do, and I consider this a real problem. I was just saying that personalization isn't a completely satisfactory explanation for the decline in Google search result quality. It is likely to be a factor in that, though.
I love the concept of DDG and have it as my standard but still use Google (via !g) for about 30-50% of my queries. Simple queries work well in DDG (which is basically Bing) but more complicated queries only really work in Google.
Sadly I've been finding the same result. Exact searches on Google are often frustrating, but lately they've been all but impossible on DDG. It seems that all search engines (including those backing DDG) are getting on the ML train and assuming they know what I'm looking for better than I do.
I understand this being default behavior, but there really needs to be a way to disable it.
I found the opposite. I started using DDG when I moved to Brave but after a month, I found I would go to DDG, search page after page and get frustrated and open google and have my result on page one or two.
Personally, my impression is that since at least ~1-4 years, the google searches return less exact matches especially when I search for an exact match of an error message, or multiple exact matches involving the same error message (when I start to become desperate I usually tend to split the error into 2 parts...).
On the other hand the non-exact hits that it returns push me from time to time in the right direction.
Having said this, I don't know of course if A) I'm too old (40) and the mindset of the younger search-people has now changed and/or B) Google just doesn't index tech forums as much as it used to and/or C) there are just fewer forum-posts and/or D) my problems became more complex (don't think so) and/or etc... .
I tried (and still try from time to time) to use DDG and Bing but without success.
Hello fellow 40 year old. I run DDG as my daily, but similarly have a hard time finding exact matches for error messages. I am suspicious, however, that I've learned how to use Google's search controls (", +, etc.) and I'm not sure they work the same on DDG. I also can't find a reference on DDG for how to control advanced searches.
So ... although I feel like we might be having the same issue, I'm not sure I'm using DDG correctly enough to say it's a problem.
Google search worked much better over 10 years ago than it does for me today, i.e. before it abandoned the what-you-search-is-what-you-get model. My once-masterful Google-fu seems to be borderline useless today. I'm not sure what happened over the years, but Google search has morphed into a completely different, less useful product, at least for me.
Any search term not wrapped in quotes can be randomly ignored today. It can inject keywords it thinks you want (but really don't). Google is great for searching modern sites like Stack Overflow, but it seems to have lost interest in servicing power users.
2-4 months? Try at least a year. Google search results are now that creamy scum that floats between an industrial wasteland and the tidal flats it was built upon during the changing of the tides.
Yup, that's when I started to switch to DDG. I remember Google saying that you needed to add '+' +before +words that must be included instead of "putting them in quotes"–how annoying. But even using their new operators, I couldn't get answers like I used to be able to. I already didn't like their data gathering practices by then, so gimping search for me made the transition a breeze.
Interestingly even up to even 9-12 months ago I remember people consistently saying that DDG was so much worse than Google, which I always figured was a result of user error, or a result of not caring about tracking and leveraging the Google profile. I'd been off the Google grid for a while so I couldn't really argue, but I knew that I got significantly better information from DuckDuckGo, having grown accustomed to the level of detail needed. These days I probably only use Google a handful of times a month. The concept that they are purposefully soiling search results to add value to ads and sponsored results sounds about right, honestly. Advertisements used to be much less relevant than the results I'd get if I inputted a string of 5+ words, but anymore I have to be careful not to accidentally click on an ad, as the results tend to be terrible, and I'd rather enter a url into a browser than click on an ad I'm actually interested in. .
Yes, i'm having to go several pages deep and even then not finding anything relevant, I've started to use other search engines and reddit to actually find useful info.
Google poured billions into their search engine for two decades to make it better. Now that have a ridiculous amount of money and power, the search results get.... objectively worse. Which brings us to the elephant in the room: what are Google's motives behind this (clearly intentional) change?
Could something as innocent as training a new neural net or testing a buggy version of the algorithm on subsets of users. But it could also be as sinister as driving traffic to those in bed with Google, silencing opposition, or effectively whitewashing the entire internet...
I was looking forward to seeing someone else share this opinion. So the google behavior is driving some users away, im wondering why others are sticking with it?
I propose that in the course of professional contact we should strive to avoid use of google as a verb.
Yes i know its not slick to say, "Perform a search - using the search engine." instead of, "Google it.";
but it starves a mentality, i think it would disconnect the G-word from the perceived face of the internet, The whole point is a monopoly eventually gets out of hand and starts screwing its users, to its own benefit, due to largesse of the users. If google is to improve itself, We the users have to force it to by ignoring it and going elsewhere, This i think starts by RE-Realizing, as a herd, that there is choice other than the Alaughabet search engine [aka google].
Most definitely. But the folk at Google are very smart, very rich, and already run the most lucrative ad platform in the world. Wouldn't hamstringing their flagship product for the sake of a few extra $B/yr harm them in the long run as more and more users switch to other search engines? They had to have considered that and made the change anyway. What's the endgame? I don't feel it's more ad clicks.
What makes you think they wouldn't? Everything else google seems to do is in the interest of short term profits. Look at all the great products they've shut down simply because they weren't all that profitable.
It's my opinion that a large portion of the websites on the front page of any search (quora and pinboard anyone) are completely bought and paid for.
I think the endgame ends up very close to same everytime this sort of thing happens.
corp gets good people like it, then they get rich and take on investors. when stocks and investors get involved, then there is an expectation of an ever increasing >RATE< of profit. if that rate decrases then stocks are dropped, and if this goes on long enough, the corp is so interested in maximum profits over a shrinking timeslice that it basically takes all and gives nothing in return, that is the point when it is no longer a service, and exodus begins. [myspace]
Maybe you had this problem before but your expectations grew faster than technology? Can you think of something from your search history and fins anything that other search engines found but Google failed?
Their keyboard predictions have gone from "OK" to "Amazing, we live in the future", and over the past couple years to "of course I didn't mean 'aaAAaAAnd', wtf were you thinking".
I frequently suspect they're starting to optimize more for $ than they were before, and ML just gives them more ways to make that number go up another % or so... but it often comes with impossible-to-predict and wildly inhuman edge cases. It's a pretty common trend when companies start focusing on small number increases - each A/B test shows improvement, but the product as a whole worsens and it drives people away in time.
About 2-3 months ago they basically nuked Youtube's search and recommendation. This was associated with some bad press about those features coming up with "harmful content" like unapproved radical politics & conspiracy theories. Now you basically see mostly curated front-page stuff plus some user stuff that had probably never come up in search before (e.g. a fairly common search term will come up with videos that are a decade old and only have 5k views). Maybe changes in Google search are related?
IMO, Youtube changed for the better. It used to focus on controversial and current, now it focus on curated and evergreen content. Exactly the kind of thing people in this thread are missing from Google Search.
Yep, I've noticed a lot more commercial results than before. To find something relevant I often have to dig deep, especially if what I'm looking for is a little bit obscure. I'm glad you mentioned it.
Google Images is a partial example, but this happened a while ago. It appears what they do is use ML to classify what is in the image, and then show images that fit those categories. It is useless now for checking things like, did this logo designer you hired off Upwork/Fiverr/etc just steal someone else's design.
Aspiring science fiction authors, or Neal Stephenson, should write a novel about a world where ML tuned models optimize everything to be just good enough not to churn customers while maximizing margins. (Also applicable to non-profit items like politicians and universities)
I've had this problem recently. I can craft a search for something just slightly obscure and specific that should, nonetheless, have had plenty of hits on the "old web", let alone now on the many-times-larger web. But "no pages found". Loosen up the search and it's nothing but Google-friendly blogspam that isn't remotely related to what I'm trying to find. I call bullshit.
Heh, oh yeah, tons of that, usually the ones most relevant to narrowing the search beyond "everything on the Web". Thanks, Google.
So then I do the quotes thing, especially quoting phrases that 100% for sure must exist on some web pages, along with all my other keywords and pretty soon I'm at "no pages found". Pull back just a little, and it's page after page of entirely unrelated-to-what-I-want blogspam.
Search console isn't really helpful in many cases. Unless there's an error, it'll probably say "crawled but not indexed", which gives you no idea why they didn't include it.
If I'm searching for something that I know exists and I cant find it there is no excuse. The search provider failed to do its job.
There is not but the webmaster should have done this and that. He was hit by a bus 10 years ago and we should be happy the content is still available.
A good search provider would link a vanished website to archive org if the content is exactly what the customer wanted.
Long long ago when posting interesting links in comments didn't trigger commercial hysteria people would cite bits of texts and have a link to the full text. Later this became simply citing a chunk of text. I use to drop a few lines from the citation into the search engine and find the original work.
From what I can tell, there's no links anywhere on the site to that particular page, you have to know the exact term and search it: http://www.gnoosic.com/discussion/
That's a good point. Googlebot probably wouldn't try out combinations in the search box, so unless the site owner provides a sitemap Google wouldn't know about entries.
I don't think the articles premise is that Google axed all content older then 5 years or so. But that it gradually discards old unique content.
Which goes against the original mission of Google to "organize the world's information and make it universally accessible".
A "bug" could be an option, but I don't expect that to be the reason. It's too easy to find examples of forgotten content. And I don't think a bug of that magnitude in Googles core business would go unnoticed.
Search. They still form a major part of their business. Through direct ad revenues but also to redirect traffic to other Google products (e.g. Maps, Youtube).
Part of the problem is that their algorithm has become weighted against blogs and personal websites.
> Rumors spread that large link pages (for surfing) might be considered “link farms” (and yes on SEO sites they were but these things eventually trickle down to little personal site webmasters too) so these started to be phased out. Then the worry was Blogrolls might be considered link farms so they slowly started to be phased out. Then the biggie: when Google deliberately filtered out all the free hosted sites from the SERP’s (they were not removed completely just sent back to page 10 or so of the Google SERP’s) and traffic to Tripod and Geocities plummeted. Why? Because they were taking up space in the first 20 organic returns knocking out corporate and commercial sites and the sites likely to become paying customers were complaining.
You're jumping from describing observable results to a state of mind or motive which you can't observe.
> Then the worry was Blogrolls might be considered link farms so they slowly started to be phased out. Then the biggie: when Google deliberately filtered out all the free hosted sites from the SERP’s...
That's all observable fact.
Why? Because they were taking up space in the first 20 organic returns knocking out corporate and commercial sites and the sites likely to become paying customers were complaining.
I think the more reasonable, less diabolical motive was that the blogs and free hosted sites were largely link farms that no one wanted to visit.
It sucks for the few legitimate pages on those platforms, but when most of the legitimate page is the rare gem in a minefield of automated copies of other blogs, just with SEO links and ads inserted.
It's like a comments section: without moderation or captchas or both, a "thriving local community" on, say, a small town news site can be overwhelmed by automated pharmaceuticals spam. Then the newspaper kills the comment section, not out of any malice towards the original community but because they don't want to deal with the spam.
And yeah, dealing with spam and black hat SEO does take resources. If you (or worse, your chosen blog host) don't keep the weeds down, soon your pasture will be overrun and burned off.
I absolutely agree with you that whether Google is intentionally diabolical or not is up in the air. My reason for quoting Brad there is to succinctly recount a history where Google has been a menace (deliberate or not) to individual blogs and websites. Blog rolls were absolutely a great way to discover new blogs and were hardly “link farms” but were an incredibly valuable resource. (An equivalent to modern friend lists.)
Where I don’t agree with you is in the portrayal of the Web as largely comprised of link farms and “few legitimate pages”. I spend a lot of my time cataloging the hidden corners of the Web and it is mostly individuals working on their personal Web projects. Spam is simple to identify (much more so than ‘clickbait’) and many of the reasons people don’t read personal websites any more isn’t because interesting and mind-blowing projects on the Web are too rare. (I don’t have statistics to back this up, but I feel like they are more common on the Web than on social media.)
The problem is that Blogspam is now a (legitimate) industry much bigger than Google can manage.
Google Search became a playground for marketing firms to dump content made by low-paid freelancers with algorithmically chosen keywords, links and headers. It's SEO on large scale. Everything is monitored via analytics and automatically posted to Wordpress. Every time Google tweaks its algorithm to catch it, they're able to A-B test and then change thousands of texts all at once.
Personal blogs can't even dream about competing with that.
In fact, those companies are actively competing with personal blogs by themselves: via tools like SEMRush and social media monitoring, they know which blogs are trending and use their tools to produce copycat content re-written by freelancers and powered by their SEO machine.
I know a startup that is churning 10 thousand blogposts per day on clients blogs, each costing from 2 to 5 dollars for a freelancer to write according to algorithmically defined parameters.
Just wait until they get posts written via OpenAI-style machine learning: the quality will be even lower.
Not only that: there's no need for black hat SEO anymore. Blogposts from random clients have links to others clients blogs, and it is algorithmically generated in order to maximize views and satisfy Google's algorithm. They have a gigantic pool of seemingly unconnected blogs to link to, so why not use it.
The irony is that companies buy this kind of blogspam to skip paying AdSense. Why pay when you can get organic search results? So not only they're damaging the usefulness of the SERP, they're directly eating Google's bottom line. These blogs also have ZERO paid advertising inside them, since they're advertising themselves.
That's the reason Bing, DuckDuckGo and Yandex still have "old web" results.
That puts Google in a very difficult position and IMO they're not wrong to fight it.
Well, I disagree. (Though I think your record of things is correct!) Certainly if you look at this as a bot war then Google's actions make sense: we need our bots to outsmart the 'bots' (human bots even!) that are writing blogs.
But look at it another way: you have lots of humans writing - and it's all of varying quality. Why not let the humans decide what's good? The early Web was curated by humans, who kept directories, Smart.com 'expert' pages, websites and blogrolls that tried to show where quality could be found. Google's bot war (and the idea that Google is the sole authority on quality) eliminated these valuable resources as collateral damage.
Maybe the problem is that PageRank (or whatever they call it these days) has run its course. I mean, it supposed to gauge "what humans think is good", but it's failing miserably. It's indeed time for a more curated, artisanal, web.
PageRank is predicated on an assumption that most pages (and thus, most links) are created/curated by humans. This was true when it was invented, but appears to be less likely now.
What gives me pause here is all the anecdotes in this thread about other engines getting results right. If the real answer is "PageRank has been successfully flooded by bots", then everyone would have bad results.
What I suspect, off nearly no evidence, is that Google is using ad tracking to inform a notion of search relevancy. My nearly unjustified belief is that that system is the one being flooded by bots.
You can see some evidence that suggests it when you search for a specific software or ebook to download.
Piracy is gone, but you will find hundreds of automatically generated credit card phishing sites full of Google Ads, sometimes promising pirated versions but serving a trojan, sometimes showing a credit card form. Some of them are on the first page, sometimes before legitimate websites.
But if their efforts in fighting it are a large part of the reason that Google search results are getting downright bad, then they're wrong in how they're fighting in.
What I mean is: I don't think their fight is misguided or evil this time, they're trying to keep the result pages useable for end users. They're just doing a terrible job out of it. (Or: they're doing a worse job than spammers)
> SEO seems to have become a huge obstacle course that smaller websites can't play.
Absolutely right. I recently started a blog, and was disheartened to learn that I have sign up for accounts with several search engines, conform to their standards and rules, give them a bunch of data... and still sometimes have mysterious issues with indexing with no real recourse. How much time and effort do I really want to spend to play the seo game? I have a job, projects, and hobbies, I don't have the time or patience to play their game of "let's fuck with things randomly until you get indexed and ranked higher". That was fun for a few hours, but I'm done with it.
If you decide to start a blog again, please contact me - I will list you in my monthly "href hunt" - a raw dump of newly discovered sites. And I can point you to directories like personalsit.es that list blogs.
And, of course, consider having a blogroll of the sites you follow, which is all of our little way of contributing to the effort of finding each other. :)
That's exactly what it is, and Google's also incentivizing many low-quality sites to engage disproportionately in SEO to boost their Google Adsense earnings too.
A friend spoke to an SEO analyst just yesterday and it seems the counterplay is to add "recency" to your posts.
If you have an older post that's great but not changed, it'll become less prominent. So go in, edit in some changes, and now it's fresh and ready to be indexed prominently again.
If this is how it goes, I guess it helps in a way. The articles we care about get attention and don't drop off. But there's so much of the old web we might lose in the haystack.
The first paragraph of the article mention the story of Tim Bray[0], which is exactly about this : Google forgetting an article which did not change location.
We've come to rely on Google too much, so much that if you are not on Google you don't exist. That's a problem with researchers that are looking for articles to cite.
>The way their algorithm most likely deals with this is a mixture of domain rank + tenure... how long has this copy of this article existed on this domain, and can we be sure this is the original copy?
This rationalization doesn't change the fact that it's incresingly hard or impossible to find certain things on Google, that they are effectively biased against certain types of websites and certain types of pages (even when the content is perfectly good) and that other search engines seem to be able to deal with these issues much better.
"Google is not forgetting the old web unless we see evidence of content disappearing from the index that have been consistently hosted at the same domain & URL since their original posts."
I can, very loosely and anecdotally, confirm.
My personal website has been online for about 20 years and I just picked some deep strings of text and searched for them and google has the whole thing indexed just fine ...
Google has weird rules and inconsistent indexing. I recently published a two-part article on using graphql/apollo with react and rails; it never indexed the first part (the graphql / rails bit) but did index the second part (react and apollo). And in fact, searching for with "graphql rails react apollo" still doesn't show any results for this page on google despite ostensibly being indexed, but it shows up on duckduckgo. And looking over the results on google, only ~4 are actually relevant to the topic, so it's not like good content is being shown.
It seems like the opposite actually -- spam is destroying Google.
They're so big that it's worth blackhats spending significant resources to game their algorithm. That induced them to implement a spam filter which is now discarding the ham along with the spam.
Which means that smaller search engines that aren't being targeted by spammers are now giving better results. That is a major long-term problem for Google if they can't avoid throwing the baby out with the bathwater like this.
People only use Google because it has historically had the best results. They'll get some way on inertia now, but that doesn't last forever. They need to fix this or they're ultimately in trouble, and we could be heading for a landscape where being a search engine above a threshold size is a liability.
IME, it's not blackhats anymore causing the problem. It's (legitimate, but shady) marketing agencies and startups handling thousands of customers and with deep pockets to do SEO research.
blackhat is as a blackhat does, its no difference how or why you screw users in someway by failing to be forthright and of candor. if you do it constituatively thats blackhat.
The web is fine, and search is fine. It's specifically Google search that's being destroyed by spam.
It's odd to put forward the hypothesis that DuckDuckGo is now better at search (aggregation) than Google is at search. But that seems to be where we have landed.
I think it may be a simple consequence of the fact that Google Search is increasingly less of a searching engine and more of an answering engine.
I think Google has been explicit about this (I may be wrong, but I seem to remember thinking about this because Google themselves said it). Essentially, I believe, they are no longer concerned about being a way to navigate all the material found on the internet. Instead, they are concerned with answering the question posed by each search attempt.
A few years ago they made a push to answer questions to the point it was in their product description on their "how Google search works" page. To quote it exactly, it used to say their objective is to "return timely, high-quality, on-topic, answers to people's questions."
And that's kind of the whole problem and why there is space for a search that actually returns results from the web in a clear and logical way.
> I think it may be a simple consequence of the fact that Google Search is increasingly less of a searching engine and more of an answering engine.
I've been thinking about this, and it seems very plausible to me. Which means that Google Search isn't really "search" anymore -- which explains why it's become so bad at that!
Too bad. I remember when Google had the best search engine going. It was a real game-changer. Those days are long gone.
In my experience, Google works better if you ask it questions like that. Not good enough, especially if you're looking for something specific and technical, but better.
They even have a tool just for this: canonical URLs. It lets websites specify which version is the source/canonical version and avoids the old copy indexing.
The author says the article was removed in 2006 ("[...] posts, were not accessible anymore") and then he re-posted the article at a new domain in 2013. That means any copy/crawl/repost of the article from 2006-2012 is now the oldest living, and thus "original", version of the article. His 2013 repost was seen as just another blog-spam copy.
Google is not forgetting the old web unless we see evidence of content disappearing from the index that have been consistently hosted at the same domain & URL since their original posts. Unless you properly 301 your URLs to new locations and consistently host your content, it's a guessing game for the crawler to determine where the original content has moved to.