Jan. 8th, 2017

marahmarie: Sheep go to heaven, goats go to hell (Default)

ETA: after posting I did a bit more Googling and found this and this, which both say wordpress.com has turned on AMP for all users by default, while the latter page states there is indeed an opt-out (yay, because actually, I wouldn't let Google cache Anti-AOL if I could figure out how to stop that while still allowing indexing so people can find it).

ETA on the ETA: but apparently, even opting out won't remove the content right away. The fastest way I can think of is to opt-out (which I did, already), then edit the page to, say, be completely blank, save it, re-visit the AMP URL, and since it fetches the latest version, that should remove the content, if not the page itself, from AMP servers. Of course I'm not going to do that, so oh, well. I have to sit back and just let someone else control it until they decide not to anymore.

What I don't like is the AMP version of my content is hosted on Google's servers, which in this case, is Wordpress deciding where my content should live without my input or permission. While I haven't looked through a Wordpress.com ToS or EULA in forever (maybe never) I wonder if them farming out my content to Google's CDN servers violates it, because I never said they could.

Also, clean URLs, which I saw in one article (can't find it) that they promised to do "when possible", my foot. Not that it matters...clean URL or dirty (and my God, the one seen below is a mess), I don't want my content on their servers. Also-also, this:

To take advantage of the Google AMP Cache, an AMP URL must be accessed directly from the cache using the AMP Cache URL format. Each time a user accesses AMP content from the cache, the content is automatically updated, and the updated version is served to the next user once the content has been cached.

Sounds like they took a page from AOL's playbook and said, "Hey, let's download the entire Internet!" except they're downloading it only after insisting the source gets rewritten by content creators or CMS owners, so they can serve ads from it even faster.

AOL literally wrote the book on "Hey, let's download the Internet!". Not original thinking. But caching the entire Internet to serve super-speedy ads from it (which is probably the entire point of AMP, let's not kid ourselves)? All Google, man, all the time - AOL missed the boat on that, completely.

Think this is resolved, for now (see ETA above) - thanks for reading!

...without my knowledge or permission, an action which tends to bother me (previous examples that have drawn my ire: afterdawn.org (one user scraped a post on Anti-AOL for a forum post) and a post about kevinshome.org* (which was inserted through Javascript into the HTML of a post on another person's blog, which had the awesome effect of stealing my page rank for the search terms it corresponded to), the latter of whom reciprocated my call-out by claiming a domain name with my name in it for the next four years, apparently so I couldn't have it for myself.

Today when I logged into Wordpress.com to do some image uploading (about the only reason I log in, anymore) I checked my stats and saw a referrer from: https://cdn.ampproject.org/v/s/intoolate.wordpress.com/2009/06/30/aol-customer-service-phone-numbers-and-contact-info-updated-3-03-2010/amp/?amp_js_v=6#origin=https%3A%2F%2Fwww.google.com&exp=a4a%3A0&channelid=0&cid=1&dialog=0&prerenderSize=1&visibilityState=prerender&paddingTop=54&history=1&p2r=0&horizontalScrolling=0&csi=0&storage=1&viewerUrl=https%3A%2F%2Fwww.google.com%2Famp%2Fs%2Fintoolate.wordpress.com%2F2009%2F06%2F30%2Faol-customer-service-phone-numbers-and-contact-info-updated-3-03-2010%2Famp%2F.

Long link, huh.

To be clear, this was a referrer, which means the visitor in question (assuming, as I might, that it wasn't a bot) was coming into my blog from the Google AMP page, not exiting out to it. The page is my content, mirrored from end to end. All links in the scrape seem to point back to my blog, but still. According to DTWhois the site belongs to Google. Playing with the URL to find different pages of my blog hosted on the domain hasn't worked so far, but then again, my blood pressure is a little too high right now to play with the URL too much.

So, questions:

  1. Who exactly is scraping my page? Or is this the work of bots?
  2. All the documentation I see seems to indicate website owners create their own AMP pages. I didn't create that page. So, did Wordpress create it? [ding ding ding we have our winner] Or did Google, or someone else? If yes, how do they do that without being me, and why would anyone do that?
  3. Is Google creating another entire web cache/new index out of AMP pages [yep], or is some individual doing this to some or all web content creators?
  4. How do I get this page taken down? [WP opt-out allowed]
  5. How do I find out if other pages of mine are on their servers?

I could easily serve as the literal poster child for confusion right now.

*Edited after posting to correct info on who scraped what.