Clone wars: Any unique value here? or just duplicates? (old, probably outdated)

2006-12-19

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

1000 monkeys at typewriters might come up with Shakespeare - but it’s much easier to just copy + paste. Duplicate content is a fact of the web - it’s almost impossible to avoid it within your own website and even worse when someone else decides to play “monkey at a typewriter” and copies your work.

Google handles duplicate content (including only duplicated parts) with a smart algorithm: It picks the “best” (in it’s opinion) and ignores the rest (mostly). There is generally no penalty for duplicate content, but you won’t get much value out of it either. If it’s your content that’s duplicated (on your site or elsewhere) then you can only hope that Google chooses the right page to display.

Either that or you actively tell Google which page you want displayed.

Adam Lasnik gives us some great “common sense” information (archive.org) on how to push our opinion to tell Google to make the right choice, but how do we really know that we have to deal with it at all? Do you know which pages are related and could be identified as duplicates? Do you know if your page has been duplicated by other people, even just snippets of it?

A simple yet effective way to check for duplicate content uses a technique which we all have used – take a line of text and search for it. A fancy name for it is “shingle analysis”. It takes the text and breaks it into pieces with the same number of words in it.

For example: “mary had a little lamb”

Could be broken up with 3-word shingles into

mary had a
had a little
a little lamb

Each of those shingles could be used to search within a known group of documents (eg your own site) or the whole web. With 3-word shingles it is of course hard to find real duplicates, they often match many different documents and texts. We can target much better with 10-15 word shingles (10 to 15 consecutive words from the document). If we match a single 10-word shingle then we can assume that it was most probably copied. If we can match 80% of our 10-word shingles then we can assume that the pages share content.

There are two neat things we can do with the shingle analysis:****

1. Find duplicate content within a suspected group of pages (within a site or for other pages)

A simple shingle analysis tool (archive.org) from my site at oy-oy.eu will let you compare up to 10 URLs and displays the results in an array.

A sample report could look like this:

In this example I took a (great) blog, it’s main page which displays snippets of the posts and a few of the current postings. It’s easy to see where one of the posts was fully shown on the main page. The problem with that is that the main page will almost always have more value than the post-page – leaving the main page ranking for the content of the post (instead of the post itself). The content on the main page will of course change over time, so it’s possible that the visitor wants to see the post but is shown the main page instead (without the post being listed). What the visitor would want to see is the post, not the main page. A better strategy would be to always make sure that a post is not shown completely on the main page, always use snippets.****

2. Find duplicated content online

Good content is hard to find (and easy to copy). The simplest way to find duplicates is - as mentioned above - to manually search for snippets from your pages. However, that can take a bit of work and you never know if you’re searching for the right snippet.

I created another simple tool to determine duplicate content through the Google or Yahoo APIs. The Google version (archive.org) requires a Google SOAP search API key, the Yahoo version (archive.org) doesn’t require anything additionally. (Both seem to have short outages where the search engine times out, and take a while to get the content checked.)

It will do a shingle-level search for the content which you specify (either a URL or just a text snippet) and determine close or near duplicates based on 12-word shingles.

Just enter your URL or the text, enter your own domain name (to exclude it from the tests) and let it run. It will create a report with 4 blocks:

A short summery of the shingles tested and their uniqueness.

URLs which share content with the content we tested. These are the URLs you need to check if you want to enforce your content ownership (eg with a DMCA request (archive.org)).

The same information on a domain name basis (in case several URLs from the same domain were copying the content).

The content itself marked up to display unique (green) or duplicated (red) shingles.

There are a two main uses for this tool:

determine how unique the content on a page is (was it a good idea or has everyone else already done the same?)
find sites that have copied or quoted your content (pick popular pages)

Additionally, it can be used for other peoples sites as follows:

determining the “value” of an affiliate page - is it using unique snippets or is it just copying the same feed as everyone else?
determining the uniqueness of a page - is it providing something unique value?

Adam Lasnik mentions (archive.org) the following with regards to the “+30 penalty”:

Is my site providing unique and compelling content?
Would most consumers find my site to be more useful than others in this space?

To me, that sounds like a simple application of shingle analysis on Google’s side. Is the page unique - how many non-unique shingles does it have? Does the page provide additional value - how many unique shingles does it have?

Warning: This page is pretty old and was imported from another website. Take any content here with a grain of salt.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Mastodon and mention me ( @hi@johnmu.com ) there. Thanks!

Practically dealing with recurring/updated items (2017)
Ten Things Google has found to be true (but not forever) (2006)
Indexing timeline (2006)
Google and their Sitemap (2006)
Watching crawler activity and indexed pages over time (2006)

Clone wars: Any unique value here? or just duplicates? (old, probably outdated)

Comments / questions

Related pages

Contents