Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This author is in his own little bubble and doesn't understand the vast amount of blog-repost spam that google has to deal with. The way their algorithm most likely deals with this is a mixture of domain rank + tenure... how long has this copy of this article existed on this domain, and can we be sure this is the original copy?

The author says the article was removed in 2006 ("[...] posts, were not accessible anymore") and then he re-posted the article at a new domain in 2013. That means any copy/crawl/repost of the article from 2006-2012 is now the oldest living, and thus "original", version of the article. His 2013 repost was seen as just another blog-spam copy.

Google is not forgetting the old web unless we see evidence of content disappearing from the index that have been consistently hosted at the same domain & URL since their original posts. Unless you properly 301 your URLs to new locations and consistently host your content, it's a guessing game for the crawler to determine where the original content has moved to.



Here is an example:

http://www.gnoosic.com/discussion/metallica__5.html

No matter how you search for the content on Google, nothing comes up:

https://www.google.com/search?q="Metallica+only+played+2+son...

DuckDuckGo has it:

https://duckduckgo.com/?q="Metallica+only+played+2+songs+fro...

I checked the wayback machine and the content has constantly been on that url for over 10 years.

This is the first example of an old forum page I tried after reading the article. So I tend to think it's true. Google is discarding the "classic" web.


Anecdotally and perhaps unrelated - has anyone else noticed a decrease in the accuracy and general quality of Google search over the past 2-4 months? They've had to have been utilizing ML to 'improve' searches for some time now, but the quality of the results has decreased suddenly and inexplicably (for me).


> Anecdotally and perhaps unrelated - has anyone else noticed a decrease in the accuracy and general quality of Google search over the past 2-4 months

Yes. Not just over the past 2-4 months, but over the past five years or so.

It's become so bad that Google is no longer the most useful search engine for me.


It all started going downhill since Google's "Hummingbird" switch to be honest. Interviewing for Google, I actually brought this up with an engineer in the search team during the lunch.

He said they haven't noticed any regressions. I said I figured that would be the case but I can definitely feel the difference as a daily user.


This is indicative of a larger issue - testing is probably as difficult as solving the halting problem i.e. code could be generated from proper tests, yet teams tend to trust their tests completely. I see high profile websites having severe usability issues or being outright broken in ways that would be immediately caught by "interns randomly click here and there" usability tests. But these version got deployed probably because testing did not show any regressions.

I tend to believe that if user complaints about new problems or regressions increase over statistical noise - there is a problem.


> yet teams tend to trust their tests completely.

Well said. This is a big problem. We see a similar problem with the use of telemetry data as well.


I noticed the same. I've wondered for years why it happened and sometimes when I'm frustrated I try to think about it. But I am not entirely sure that the degradation in search results for me happened only in the past 5 years. Maybe, but I'm not sure.

I had no idea about Hummingbird though.


In 2009 Google was amazing for diagnosing Linux issues. I would just copy the error from the console and I'd have links to the issue tracker, a work around and the version in which the bug was fixed. Today I get a link to some github project that has nothing to do with what I'm working on and was closed as being an upstream issue.


I don't have the time, money or energy to build a specific crawler, but a Linux search engine that indexed all the major distros, packages, mailing lists, forums and issue trackers would be amazing.


> Google's "Hummingbird"

I had assumed that Google search had gone downhill because it started trying to "personalize" my search results. That wasn't a great explanation though, as I don't use a Google account.

Hummingbird seems a much more likely explanation.


Oh, they still "personalize" your search results.

I do a full clear on my web browser (cookies / offline storage / history, everything) and then open YouTube in a private browsing window and it asks me which of my two Gmail accounts I want to log in with. I'd guess it's just a combo of external IP and browser fingerprint, but it's creepy.


I know they do, and I consider this a real problem. I was just saying that personalization isn't a completely satisfactory explanation for the decline in Google search result quality. It is likely to be a factor in that, though.


I noticed the same but only on an absolute level. Compared to Bing, DDG & Co, Google is still by far the best search engine.


What would you recommend instead?


DuckDuckGo.com is my daily driver.


I love the concept of DDG and have it as my standard but still use Google (via !g) for about 30-50% of my queries. Simple queries work well in DDG (which is basically Bing) but more complicated queries only really work in Google.


Sadly I've been finding the same result. Exact searches on Google are often frustrating, but lately they've been all but impossible on DDG. It seems that all search engines (including those backing DDG) are getting on the ML train and assuming they know what I'm looking for better than I do.

I understand this being default behavior, but there really needs to be a way to disable it.


DDG has become my first stop. It gets me what I need 90% of the time.


I found the opposite. I started using DDG when I moved to Brave but after a month, I found I would go to DDG, search page after page and get frustrated and open google and have my result on page one or two.


I've heard others say similar things. That's simply not my experience. I wonder if it depends on the sorts of searches that we each tend to perform?


For me English has been working decently on DDG but last time I tried I had a really hard time getting decent results in other languages.


Qwant might be a good option, as it's European it should be better for searching in (some) other languages.

https://lite.qwant.com/


Personally, my impression is that since at least ~1-4 years, the google searches return less exact matches especially when I search for an exact match of an error message, or multiple exact matches involving the same error message (when I start to become desperate I usually tend to split the error into 2 parts...).

On the other hand the non-exact hits that it returns push me from time to time in the right direction.

Having said this, I don't know of course if A) I'm too old (40) and the mindset of the younger search-people has now changed and/or B) Google just doesn't index tech forums as much as it used to and/or C) there are just fewer forum-posts and/or D) my problems became more complex (don't think so) and/or etc... .

I tried (and still try from time to time) to use DDG and Bing but without success.

Does anybody else have the same impression?


Hello fellow 40 year old. I run DDG as my daily, but similarly have a hard time finding exact matches for error messages. I am suspicious, however, that I've learned how to use Google's search controls (", +, etc.) and I'm not sure they work the same on DDG. I also can't find a reference on DDG for how to control advanced searches.

So ... although I feel like we might be having the same issue, I'm not sure I'm using DDG correctly enough to say it's a problem.


Google search worked much better over 10 years ago than it does for me today, i.e. before it abandoned the what-you-search-is-what-you-get model. My once-masterful Google-fu seems to be borderline useless today. I'm not sure what happened over the years, but Google search has morphed into a completely different, less useful product, at least for me.

Any search term not wrapped in quotes can be randomly ignored today. It can inject keywords it thinks you want (but really don't). Google is great for searching modern sites like Stack Overflow, but it seems to have lost interest in servicing power users.


2-4 months? Try at least a year. Google search results are now that creamy scum that floats between an industrial wasteland and the tidal flats it was built upon during the changing of the tides.


Upwards of five years, actually. It was already declining when they decided to fuck the +WORD operator for their Facebook ripoff.


So you're also saying ever since Hummingbird [0], Google search hasn't been the same.

I agree.

[0] https://en.wikipedia.org/wiki/Google_Hummingbird


Yup, that's when I started to switch to DDG. I remember Google saying that you needed to add '+' +before +words that must be included instead of "putting them in quotes"–how annoying. But even using their new operators, I couldn't get answers like I used to be able to. I already didn't like their data gathering practices by then, so gimping search for me made the transition a breeze. Interestingly even up to even 9-12 months ago I remember people consistently saying that DDG was so much worse than Google, which I always figured was a result of user error, or a result of not caring about tracking and leveraging the Google profile. I'd been off the Google grid for a while so I couldn't really argue, but I knew that I got significantly better information from DuckDuckGo, having grown accustomed to the level of detail needed. These days I probably only use Google a handful of times a month. The concept that they are purposefully soiling search results to add value to ads and sponsored results sounds about right, honestly. Advertisements used to be much less relevant than the results I'd get if I inputted a string of 5+ words, but anymore I have to be careful not to accidentally click on an ad, as the results tend to be terrible, and I'd rather enter a url into a browser than click on an ad I'm actually interested in. .


I asked Jeeve's about this, he picked up a Magic 8 Ball and it said, "not so good".


Yes, i'm having to go several pages deep and even then not finding anything relevant, I've started to use other search engines and reddit to actually find useful info.


Google poured billions into their search engine for two decades to make it better. Now that have a ridiculous amount of money and power, the search results get.... objectively worse. Which brings us to the elephant in the room: what are Google's motives behind this (clearly intentional) change?

Could something as innocent as training a new neural net or testing a buggy version of the algorithm on subsets of users. But it could also be as sinister as driving traffic to those in bed with Google, silencing opposition, or effectively whitewashing the entire internet...


> Which brings us to the elephant in the room: what are Google's motives behind this (clearly intentional) change?

They are maximising ad revenue, not the search relevance/usefulness.


I was looking forward to seeing someone else share this opinion. So the google behavior is driving some users away, im wondering why others are sticking with it? I propose that in the course of professional contact we should strive to avoid use of google as a verb. Yes i know its not slick to say, "Perform a search - using the search engine." instead of, "Google it."; but it starves a mentality, i think it would disconnect the G-word from the perceived face of the internet, The whole point is a monopoly eventually gets out of hand and starts screwing its users, to its own benefit, due to largesse of the users. If google is to improve itself, We the users have to force it to by ignoring it and going elsewhere, This i think starts by RE-Realizing, as a herd, that there is choice other than the Alaughabet search engine [aka google].


> I propose that in the course of professional contact we should strive to avoid use of google as a verb.

Stopped using "google" as a verb a long, long time ago, in favor of just saying "search". I don't think that's ever confused anyone.


One possibility is that Google hasn't gotten worse, but the spammers have come up with new techniques that Google hasn't adapted to.


Maybe bad organic results lead to more ad clicks.


Most definitely. But the folk at Google are very smart, very rich, and already run the most lucrative ad platform in the world. Wouldn't hamstringing their flagship product for the sake of a few extra $B/yr harm them in the long run as more and more users switch to other search engines? They had to have considered that and made the change anyway. What's the endgame? I don't feel it's more ad clicks.


What makes you think they wouldn't? Everything else google seems to do is in the interest of short term profits. Look at all the great products they've shut down simply because they weren't all that profitable.

It's my opinion that a large portion of the websites on the front page of any search (quora and pinboard anyone) are completely bought and paid for.


I think the endgame ends up very close to same everytime this sort of thing happens. corp gets good people like it, then they get rich and take on investors. when stocks and investors get involved, then there is an expectation of an ever increasing >RATE< of profit. if that rate decrases then stocks are dropped, and if this goes on long enough, the corp is so interested in maximum profits over a shrinking timeslice that it basically takes all and gives nothing in return, that is the point when it is no longer a service, and exodus begins. [myspace]


Maybe you had this problem before but your expectations grew faster than technology? Can you think of something from your search history and fins anything that other search engines found but Google failed?


Their keyboard predictions have gone from "OK" to "Amazing, we live in the future", and over the past couple years to "of course I didn't mean 'aaAAaAAnd', wtf were you thinking".

I frequently suspect they're starting to optimize more for $ than they were before, and ML just gives them more ways to make that number go up another % or so... but it often comes with impossible-to-predict and wildly inhuman edge cases. It's a pretty common trend when companies start focusing on small number increases - each A/B test shows improvement, but the product as a whole worsens and it drives people away in time.


About 2-3 months ago they basically nuked Youtube's search and recommendation. This was associated with some bad press about those features coming up with "harmful content" like unapproved radical politics & conspiracy theories. Now you basically see mostly curated front-page stuff plus some user stuff that had probably never come up in search before (e.g. a fairly common search term will come up with videos that are a decade old and only have 5k views). Maybe changes in Google search are related?


IMO, Youtube changed for the better. It used to focus on controversial and current, now it focus on curated and evergreen content. Exactly the kind of thing people in this thread are missing from Google Search.

Maybe some similar change is coming to Search.


Yep, I've noticed a lot more commercial results than before. To find something relevant I often have to dig deep, especially if what I'm looking for is a little bit obscure. I'm glad you mentioned it.


Yes ! This morning I was not finding exactly what I was looking for in ddg and felt back to Google and result were quite noticibly worst.

To me they started spiraling down when they started to give too much power to designers. Form over content is a terrible idea for a search engine ...


Google Images is a partial example, but this happened a while ago. It appears what they do is use ML to classify what is in the image, and then show images that fit those categories. It is useless now for checking things like, did this logo designer you hired off Upwork/Fiverr/etc just steal someone else's design.

Aspiring science fiction authors, or Neal Stephenson, should write a novel about a world where ML tuned models optimize everything to be just good enough not to churn customers while maximizing margins. (Also applicable to non-profit items like politicians and universities)


Google Images still checks for exact matches. The ML stuff is an extra.


Can you give an example from your search history? How can you quantify that results got worse?


I've had this problem recently. I can craft a search for something just slightly obscure and specific that should, nonetheless, have had plenty of hits on the "old web", let alone now on the many-times-larger web. But "no pages found". Loosen up the search and it's nothing but Google-friendly blogspam that isn't remotely related to what I'm trying to find. I call bullshit.


Loosen it up? You mean google didn't automatically remove your keywords for you?


Heh, oh yeah, tons of that, usually the ones most relevant to narrowing the search beyond "everything on the Web". Thanks, Google.

So then I do the quotes thing, especially quoting phrases that 100% for sure must exist on some web pages, along with all my other keywords and pretty soon I'm at "no pages found". Pull back just a little, and it's page after page of entirely unrelated-to-what-I-want blogspam.


https://www.google.com/search?q=site%3Awww.gnoosic.com+Metal...

Looks like only page 6 is indexed for some reason. The site owner would be able to check the webmaster tools on Google to see why.


Search console isn't really helpful in many cases. Unless there's an error, it'll probably say "crawled but not indexed", which gives you no idea why they didn't include it.


there are 3 parties:

the end user searching for the content

the webmaster or author of the content

the search provider

If I'm searching for something that I know exists and I cant find it there is no excuse. The search provider failed to do its job.

There is not but the webmaster should have done this and that. He was hit by a bus 10 years ago and we should be happy the content is still available.

A good search provider would link a vanished website to archive org if the content is exactly what the customer wanted.

Long long ago when posting interesting links in comments didn't trigger commercial hysteria people would cite bits of texts and have a link to the full text. Later this became simply citing a chunk of text. I use to drop a few lines from the citation into the search engine and find the original work.

Just look!

https://www.google.com/search?q=Looks+like+only+page+6+is+in....

As i'm writing this there are exactly 45 search results above the one that should have been displayed.

There is no excuse like HN not ranking enough, they didn't not index the page, the other results didn't match the query better.

If we do this with 4 exact lines from a less popular site it will end up some place on page 20 of the search results.

Another example, I really don't care for indexing but here is an article that I always (jokingly) refer to as my greatest work.

The exact title:

https://www.google.com/search?q=%22the+wrath+of+the+book%22

A really weird result. Safe to say nothing matching is there.

The first many words from the text:

https://www.google.com/search?q=I+think+someone+%28you%29+sh...

It doesn't find it.

Then we check if it is even indexed...

https://www.google.com/search?q=http%3A%2F%2Fblog.go-here.nl...

And there it is! Why does it even crawl the page?

It also lists websites that have the number 8616 on them and ones with both the word "blog" and "here" in the text.

I'm not suppose to laugh?


Probably because the site is not https, and Google rank includes https: https://www.sangfroidwebdesign.com/search-engine-optimizatio...


page 5 seems not indexed. Everything on others pages can be found with Google.

you can force the site with "site:... ": https://www.google.com/search?q="metallica+only+played+2+son...

it's doesn't find page 5 with these terms but find page 6.

There is probably an issue within the page 5 itself.


From what I can tell, there's no links anywhere on the site to that particular page, you have to know the exact term and search it: http://www.gnoosic.com/discussion/

How is Google supposed to find that out?!


That's a good point. Googlebot probably wouldn't try out combinations in the search box, so unless the site owner provides a sitemap Google wouldn't know about entries.


How are you determining nothing links this? Google?


No, I'm browsing the site and I'm unable to find any links to those band-specific pages.


Seeing Metallica on HN makes me feel much more welcome :)


Well, search may have bugs (or undocumented features) too, I have googled content from other pages on this site (related to Metallica). Page 6 for example: https://www.google.com/search?q=%22Listen+up+you+fags+metall...


I don't think the articles premise is that Google axed all content older then 5 years or so. But that it gradually discards old unique content.

Which goes against the original mission of Google to "organize the world's information and make it universally accessible".

A "bug" could be an option, but I don't expect that to be the reason. It's too easy to find examples of forgotten content. And I don't think a bug of that magnitude in Googles core business would go unnoticed.


>> Googles core business

Which core business are you referring to?


Search. They still form a major part of their business. Through direct ad revenues but also to redirect traffic to other Google products (e.g. Maps, Youtube).


Interestingly, even though DuckDuckGo finds the post, Bing doesn't seem to.



This is reminding me of the meta search engines that consolidated results from multiple sources. I haven't used one of those in probably 15 years.


Not showing up here. Not even if I add quotes.


I tried with quotes and it doesn't show up, ironically. It must be without quotes.

It forced my to solve a bunch of CAPTCHAS too.


I do see it on Bing.

Also on the "Million Short" search engine mentioned by kickscondor:

https://millionshort.com/search?keywords=%22Metallica%20only...

I never saw that one. Do they have their own crawler?


Thanks a lot for the example!


Part of the problem is that their algorithm has become weighted against blogs and personal websites.

> Rumors spread that large link pages (for surfing) might be considered “link farms” (and yes on SEO sites they were but these things eventually trickle down to little personal site webmasters too) so these started to be phased out. Then the worry was Blogrolls might be considered link farms so they slowly started to be phased out. Then the biggie: when Google deliberately filtered out all the free hosted sites from the SERP’s (they were not removed completely just sent back to page 10 or so of the Google SERP’s) and traffic to Tripod and Geocities plummeted. Why? Because they were taking up space in the first 20 organic returns knocking out corporate and commercial sites and the sites likely to become paying customers were complaining.

https://ramblinggit.com/2018/08/when-the-social-silos-fall/

SEO seems to have become a huge obstacle course that smaller websites can't play.


You're jumping from describing observable results to a state of mind or motive which you can't observe.

> Then the worry was Blogrolls might be considered link farms so they slowly started to be phased out. Then the biggie: when Google deliberately filtered out all the free hosted sites from the SERP’s...

That's all observable fact.

Why? Because they were taking up space in the first 20 organic returns knocking out corporate and commercial sites and the sites likely to become paying customers were complaining.

I think the more reasonable, less diabolical motive was that the blogs and free hosted sites were largely link farms that no one wanted to visit.

It sucks for the few legitimate pages on those platforms, but when most of the legitimate page is the rare gem in a minefield of automated copies of other blogs, just with SEO links and ads inserted.

It's like a comments section: without moderation or captchas or both, a "thriving local community" on, say, a small town news site can be overwhelmed by automated pharmaceuticals spam. Then the newspaper kills the comment section, not out of any malice towards the original community but because they don't want to deal with the spam.

And yeah, dealing with spam and black hat SEO does take resources. If you (or worse, your chosen blog host) don't keep the weeds down, soon your pasture will be overrun and burned off.


I absolutely agree with you that whether Google is intentionally diabolical or not is up in the air. My reason for quoting Brad there is to succinctly recount a history where Google has been a menace (deliberate or not) to individual blogs and websites. Blog rolls were absolutely a great way to discover new blogs and were hardly “link farms” but were an incredibly valuable resource. (An equivalent to modern friend lists.)

Where I don’t agree with you is in the portrayal of the Web as largely comprised of link farms and “few legitimate pages”. I spend a lot of my time cataloging the hidden corners of the Web and it is mostly individuals working on their personal Web projects. Spam is simple to identify (much more so than ‘clickbait’) and many of the reasons people don’t read personal websites any more isn’t because interesting and mind-blowing projects on the Web are too rare. (I don’t have statistics to back this up, but I feel like they are more common on the Web than on social media.)


> I spend a lot of my time cataloging the hidden corners of the Web and it is mostly individuals working on their personal Web projects.

That sounds interesting. Do you have a list of some interesting projects that you're willing to share?


I catalog my findings on my blog: https://www.kickscondor.com/ and I have a directory of my favorites: https://href.cool/.

Thankyou for asking. If you know of any sweet links, pass them along!


Awesome, thanks for sharing!


I wish they would filter out Pinterest by default (instead of adding -pinterest), they're worse than old Blogrolls.


It's not Google's fault this time.

The problem is that Blogspam is now a (legitimate) industry much bigger than Google can manage.

Google Search became a playground for marketing firms to dump content made by low-paid freelancers with algorithmically chosen keywords, links and headers. It's SEO on large scale. Everything is monitored via analytics and automatically posted to Wordpress. Every time Google tweaks its algorithm to catch it, they're able to A-B test and then change thousands of texts all at once.

Personal blogs can't even dream about competing with that.

In fact, those companies are actively competing with personal blogs by themselves: via tools like SEMRush and social media monitoring, they know which blogs are trending and use their tools to produce copycat content re-written by freelancers and powered by their SEO machine.

I know a startup that is churning 10 thousand blogposts per day on clients blogs, each costing from 2 to 5 dollars for a freelancer to write according to algorithmically defined parameters.

Just wait until they get posts written via OpenAI-style machine learning: the quality will be even lower.

Not only that: there's no need for black hat SEO anymore. Blogposts from random clients have links to others clients blogs, and it is algorithmically generated in order to maximize views and satisfy Google's algorithm. They have a gigantic pool of seemingly unconnected blogs to link to, so why not use it.

The irony is that companies buy this kind of blogspam to skip paying AdSense. Why pay when you can get organic search results? So not only they're damaging the usefulness of the SERP, they're directly eating Google's bottom line. These blogs also have ZERO paid advertising inside them, since they're advertising themselves.

That's the reason Bing, DuckDuckGo and Yandex still have "old web" results.

That puts Google in a very difficult position and IMO they're not wrong to fight it.


Well, I disagree. (Though I think your record of things is correct!) Certainly if you look at this as a bot war then Google's actions make sense: we need our bots to outsmart the 'bots' (human bots even!) that are writing blogs.

But look at it another way: you have lots of humans writing - and it's all of varying quality. Why not let the humans decide what's good? The early Web was curated by humans, who kept directories, Smart.com 'expert' pages, websites and blogrolls that tried to show where quality could be found. Google's bot war (and the idea that Google is the sole authority on quality) eliminated these valuable resources as collateral damage.


I agree with you.

Maybe the problem is that PageRank (or whatever they call it these days) has run its course. I mean, it supposed to gauge "what humans think is good", but it's failing miserably. It's indeed time for a more curated, artisanal, web.


PageRank is predicated on an assumption that most pages (and thus, most links) are created/curated by humans. This was true when it was invented, but appears to be less likely now.

What gives me pause here is all the anecdotes in this thread about other engines getting results right. If the real answer is "PageRank has been successfully flooded by bots", then everyone would have bad results.

What I suspect, off nearly no evidence, is that Google is using ad tracking to inform a notion of search relevancy. My nearly unjustified belief is that that system is the one being flooded by bots.


You can see some evidence that suggests it when you search for a specific software or ebook to download.

Piracy is gone, but you will find hundreds of automatically generated credit card phishing sites full of Google Ads, sometimes promising pirated versions but serving a trojan, sometimes showing a credit card form. Some of them are on the first page, sometimes before legitimate websites.


> IMO they're not wrong to fight it.

But if their efforts in fighting it are a large part of the reason that Google search results are getting downright bad, then they're wrong in how they're fighting in.


I agree with you.

What I mean is: I don't think their fight is misguided or evil this time, they're trying to keep the result pages useable for end users. They're just doing a terrible job out of it. (Or: they're doing a worse job than spammers)


>It's not Google's fault this time.

Isn't Google responsible for making Internet advertising accessible and widespread? They developed and launched AdWords (2000) and AdSense (2003).


> SEO seems to have become a huge obstacle course that smaller websites can't play.

Absolutely right. I recently started a blog, and was disheartened to learn that I have sign up for accounts with several search engines, conform to their standards and rules, give them a bunch of data... and still sometimes have mysterious issues with indexing with no real recourse. How much time and effort do I really want to spend to play the seo game? I have a job, projects, and hobbies, I don't have the time or patience to play their game of "let's fuck with things randomly until you get indexed and ranked higher". That was fun for a few hours, but I'm done with it.


If you decide to start a blog again, please contact me - I will list you in my monthly "href hunt" - a raw dump of newly discovered sites. And I can point you to directories like personalsit.es that list blogs.

And, of course, consider having a blogroll of the sites you follow, which is all of our little way of contributing to the effort of finding each other. :)


I plan on posting some more soon; work just got crazy for a bit. And that's a good idea, I should provide links to blogs I follow!


That's exactly what it is, and Google's also incentivizing many low-quality sites to engage disproportionately in SEO to boost their Google Adsense earnings too.


A friend spoke to an SEO analyst just yesterday and it seems the counterplay is to add "recency" to your posts.

If you have an older post that's great but not changed, it'll become less prominent. So go in, edit in some changes, and now it's fresh and ready to be indexed prominently again.

If this is how it goes, I guess it helps in a way. The articles we care about get attention and don't drop off. But there's so much of the old web we might lose in the haystack.


The first paragraph of the article mention the story of Tim Bray[0], which is exactly about this : Google forgetting an article which did not change location.

[0] : https://www.tbray.org/ongoing/When/201x/2018/01/15/Google-is...


Yesterday I noticed that Google Scholar forgot one of my articles from 2018, on arXiv. See: https://scholar.google.com/scholar?q=arXiv%3A1811.04960 Google Scholar is not the same as Google Search, which can still find it https://www.google.com/search?q=arXiv%3A1811.04960 For how long, I have no idea. The article was at the same link all the time and arXiv is very reputable.


I also noticed that all our scholarly articles are gone from Google Scholar. The only thing there is our two highly cited books. https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=site%...

We've come to rely on Google too much, so much that if you are not on Google you don't exist. That's a problem with researchers that are looking for articles to cite.


Somebody starts a site for collecting "scholar dropouts"? An article qualifies as a scholar dropout if:

- it was previously available on Google Scholar

- it cannot be retrieved, or the search on Google Scholar gives a misleading result (for example it gives another article, as explained in [1])

Please help to make a list of scholar dropouts! Thank you.

[1] https://news.ycombinator.com/item?id=19604722 HN comment with evidence

[2] https://news.ycombinator.com/item?id=19604955 HN reply with another evidence


Is this recent? In my case I noticed it yesterday.


The articles are still on Google Search though: https://www.google.com/search?q=site%3Arepo.risat.org

Fingers crossed they don't get dropped from the main index too.


I also just noticed it. No idea when the rest of the papers were dropped.


I sent today a message to Google Scholar with this https://support.google.com/scholar/contact/general


>The way their algorithm most likely deals with this is a mixture of domain rank + tenure... how long has this copy of this article existed on this domain, and can we be sure this is the original copy?

This rationalization doesn't change the fact that it's incresingly hard or impossible to find certain things on Google, that they are effectively biased against certain types of websites and certain types of pages (even when the content is perfectly good) and that other search engines seem to be able to deal with these issues much better.


"Google is not forgetting the old web unless we see evidence of content disappearing from the index that have been consistently hosted at the same domain & URL since their original posts."

I can, very loosely and anecdotally, confirm.

My personal website has been online for about 20 years and I just picked some deep strings of text and searched for them and google has the whole thing indexed just fine ...


Google has weird rules and inconsistent indexing. I recently published a two-part article on using graphql/apollo with react and rails; it never indexed the first part (the graphql / rails bit) but did index the second part (react and apollo). And in fact, searching for with "graphql rails react apollo" still doesn't show any results for this page on google despite ostensibly being indexed, but it shows up on duckduckgo. And looking over the results on google, only ~4 are actually relevant to the topic, so it's not like good content is being shown.


I tested it with an article on my own website from 2003.

I first posted it on

http://jeenaparadies.de/artikel/webdesign

then I had a 301 redirect there for a couple of years to

http://jeenaparadies.net/artikel/webdesign

until I stopped paying for the .de domain. About 5 years ago I made another 301 redirect to

http://paradies.jeena.net/artikel/webdesign

which is still in place. DDG finds it but not Google and actually neither does Bing.


So as with all other open systems, spam is destroying the web.


It seems like the opposite actually -- spam is destroying Google.

They're so big that it's worth blackhats spending significant resources to game their algorithm. That induced them to implement a spam filter which is now discarding the ham along with the spam.

Which means that smaller search engines that aren't being targeted by spammers are now giving better results. That is a major long-term problem for Google if they can't avoid throwing the baby out with the bathwater like this.

People only use Google because it has historically had the best results. They'll get some way on inertia now, but that doesn't last forever. They need to fix this or they're ultimately in trouble, and we could be heading for a landscape where being a search engine above a threshold size is a liability.


IME, it's not blackhats anymore causing the problem. It's (legitimate, but shady) marketing agencies and startups handling thousands of customers and with deep pockets to do SEO research.


I count those agencies as a variety of blackhat.


blackhat is as a blackhat does, its no difference how or why you screw users in someway by failing to be forthright and of candor. if you do it constituatively thats blackhat.


The web is fine, and search is fine. It's specifically Google search that's being destroyed by spam.

It's odd to put forward the hypothesis that DuckDuckGo is now better at search (aggregation) than Google is at search. But that seems to be where we have landed.


I think it may be a simple consequence of the fact that Google Search is increasingly less of a searching engine and more of an answering engine.

I think Google has been explicit about this (I may be wrong, but I seem to remember thinking about this because Google themselves said it). Essentially, I believe, they are no longer concerned about being a way to navigate all the material found on the internet. Instead, they are concerned with answering the question posed by each search attempt.


That's exactly it.

A few years ago they made a push to answer questions to the point it was in their product description on their "how Google search works" page. To quote it exactly, it used to say their objective is to "return timely, high-quality, on-topic, answers to people's questions."

And that's kind of the whole problem and why there is space for a search that actually returns results from the web in a clear and logical way.


In that case, they have a branding problem, and should rename Google Search to Google Answers.


> I think it may be a simple consequence of the fact that Google Search is increasingly less of a searching engine and more of an answering engine.

I've been thinking about this, and it seems very plausible to me. Which means that Google Search isn't really "search" anymore -- which explains why it's become so bad at that!

Too bad. I remember when Google had the best search engine going. It was a real game-changer. Those days are long gone.


Does Google work if my questions is "what's a good article about X" ? I'm willing to modify my search terms to speak Google's language.


In my experience, Google works better if you ask it questions like that. Not good enough, especially if you're looking for something specific and technical, but better.


Yeah, but how DDG was able to show "the new original"?


DDG's primary search is actually Bing IIRC.


DDG isn't only using Google, it uses other search engines too.


Mostly Bing.


They even have a tool just for this: canonical URLs. It lets websites specify which version is the source/canonical version and avoids the old copy indexing.


If the originally indexed copy no longer exists, Google shouldn't down-rank a reposted version!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: