So you have Hadoop up and running and data loaded. Your data scientists and softwareRead More
The hard work done by SEO you probably didn’t know about
You‘ve probably heard of search engine optimisation (SEO) but do you know how it works? I’ll discuss one of the techniques used and show an example. Note that some mathematical concepts are mentioned but I’ve tried to keep that to a minimum!
If I were to search for “lasagne”, it’s likely that I’m looking for a recipe and through a technique called latent semantic analysis, search engines know this. What exactly makes it know this though?
Latent semantic indexing (LSI) is just one of the techniques used SEO from the field of natural language processing. It was introduced after people tried to cheat ranking algorithms like Google’s page-rank by filling their web pages with an abundance of popular keywords. But too many keywords with little relation actually don’t produce good content and will be ignored by LSI.
Mathematically, it’s based on producing a term-document matrix to represent the data and dissecting it to uncover hidden attributes. The “term-document” matrix basically sums the words that appear in a set of documents. For our SEO topic let’s take some text extracts from two websites:
extract1 = “The ultimate homemade lasagne made with beef and pork.”
extract2 = “A classic beef lasagne recipe made to perfection.”
extract3 = “Romeo and Juliet”
extract4 = “Juliet: O happy dagger!”
extract5 = “Romeo died by the dagger”
The first two were taken from a recipe website and the other three were from another which had Romeo and Juliet quotes. The term term-document matrix for this would be:
|Term||Extract 1||Extract 2||Extract 3||Extract 4||Extract 5|
Where the rows represent the terms, the columns the documents and the values are the number of times a word appears in a document. Common words like “the”, “and” etc. are generally excluded.
A column of this matrix is known as a vector and what we want to find is the unique set of these vectors (AKA the basis) that can represent all the other vectors in our matrix. For example:
|Term||Extract 1||Extract 2|
You can represent the document 2 vector by multiplying document 1 by 2 so there’s no need to classify these separately.
So why do we want these? Well they’re useful for classifying traits. Take a (probably) more relatable example of Netflix viewership between 2 groups:
|Term||Group 1 (Europe)||Group 2 (USA)|
|House of cards||0||12|
|Rick and Morty||7||0|
Where the numbers represent how many in a group have watched that series. Their viewing behaviour has no overlap and we can define two different viewer profiles from this. i.e. Netflix wants to know the types of viewers it has. In our SEO word/document example we want to know which words belong to a particular website.
We could get one of these profiles for each website meaning that they’re all very different but usually it reduces to a few especially with a large number of websites. The examples above are fairly small and easy to visually inspect. To be more rigorous with larger matrices, in mathematics there’s a technique called singular value decomposition (SVD) that can help us.
Cutting it with SVD
Rather than inspecting everything in one large matrix, SVD separates it into three components each defining a unique relationship. If you’re familiar time series analysis, think of it as decomposing a time series into the trend, seasonal and remaining noise components.
Check the python code for the SVD results but they are of the form:
M = U * S * VT (where M is the original matrix)
|Features||Term 1||Term 2||…||Term n|
The U matrix shows the relationship between the words and the features that we’re trying to uncover.
|Features||feature 1||feature 2||…||feature n|
The S shows the strength of the features hence why only the diagonal values are populated with non-zeroes. This is where you’d see those keyword filling websites failing because their behaviour of “a bit of everything” is probably not a very common occurrence i.e. not a very strong feature.
|Features||Extract 1||Extract 2||…||Extract n|
Lastly the Vt matrix showing the hidden feature to website relationship.
Removing the noise
Let’s look at the strength of a our features in the matrix S. The idea is that if the feature is not strong enough then it’s probably fine to omit it e.g. it could be an anomaly, an extra website that got included unexpectedly. Generally speaking you’re looking for a sharp drop in this diagonal set of values as they’re ordered, to cut off the noise.
Our S is:
|Features||feature 1||feature 2||feature 3||feature 4||feature 5|
In our case I’d say the values after the second one are too small to be significant. Now with this reduction we can represent the terms and documents as their own vectors (you’ll notice that the SVD result defines these in relation to the features). This can be done by multiplying the strength of the features S by either the term or document vectors:
U2 * S2
for the term vectors and:
S2 * VT2
for the document vectors. Where the 2 is referring to the features we’ve kept.
Grouping and SEO
So what can we do with these? Well we can now geometrically plot these and see how a search query can relate to it. For example if we query the words
query = (word_1 + word_2)/2
The division is to determine the center or centroid of the two words as the query for them both is going to sit between their individual coordinates. In our case I search for “lasagne” and “recipe”.
Plotting this gives:
We can see that our query is likely to fall near the extract 1 and 2 which is what you would expect if you’re looking for a lasagne recipe. This not only applies to the context of the webpage and overall site but also to the links. Examining the links (called slugs) first rather than the whole website would severely decrease the size of the matrix.
If you look at the link for this blog post you’ll see that it contains keywords only and doesn’t exactly match the title I’ve given it, that’s basically removing the common words and only putting the focus words which it would count for the matrix.
And that’s the gist of it.
You could apply a clustering algorithm (e.g. k-means for distance based metrics and expectation maximisation for probability distribution approaches) to these vectors would establish the groups more definitively especially with more complicated data.
This can be further developed by looking for synonyms. For example SEO will know that coriander and cilantro refer to the same thing and if I search for either, I should get the same results. To do this we multiply our reduced SVD results to get a reduced version of the original matrix:
M = U2 * S2 * VT2
Now we multiply this by itself flipped on its side. This returns the relationships between terms over the documents where high values indicate a relationship of some sort:
M * M2
You can certainly find more material on the web to do more sophisticated things but I’ve linked my python code give you a head start!