To begin with we shall be using hadoop for handling the colossal data.
The procedure involves considering each document and eliminating its stop words,(stop words are redundant words such as a,the,an,which etc..A list of stop words can be obtained from here) .Then we can use map reduce algorithm in hadoop distributed file system to identify the TFIDF associated with each word in a document
What is TFIDF ?
Well its composed of TF (Term frequency)and IDF(Inverse Document frequency)
The term frequency indicates the how many times a term(word ) occurs in a document.And the document frequency (DF in IDF, Also IDF=1/DF) indicates how many times the word occurs in the document.. However actually they give you the term count and document count for the word.To compute the frequency just divide the term count by total number of words in that document
(number of times W occurs in document D)
OR TF for word W= ------------------------------------------------------------------
(Total number of words in document D)
and IDF would be
(number of times W occurs in all documents)
DF = ------------------------------------------------------------------
(Total number of words in all documents)
So you can agree that higher the TF and lower the DF ,the word is important and can be used to analyze the characteristics of the document.Say a word w which occurs many times in a single document D but less in all other documents then this word can be used statistically to identify D.
So TFIDF would be TF * log(1/DF)
Log is logarithm to base 10
_
A_________
| w |
| w |
| w |
|_______w|
B__________
| w |
| |
| |
|_______|
C________
| w|
| |
| |
|_______|
D__________
| |
| |
| |
|_______|
Consider the above example where document A has a lot of W words the same W is rarely present in other document then it has a lower DF, and higher TF
Thus the TFIDF for the word W in document A is high.
So what we do here is identify the tfidf of all the word|document in corpus .Set a threshold for the TFIDF and consider those words whose tfidf is above the threshold.We can identify the cosine similarity between two documents from these TFIDF
Build a graph of the corpus wherin each node is a document and the edge between them are weighted based on cosine similarity .Higher the weight more similar are the nodes.
(Doc a)------<cosine similarity(a,b)>-----(Doc b) ..etc
Cosine-similarity between two documents a,b= sum ( TFIDF of a * TFIDF b)
Use graph clustering technique to cluster similar documents.We shall implement
this paper for clustering.
The technique involves randomly move about the graph initially and intialise the nodes by random labels , then we shall set a threshold on the weight of the edge between nodes , or we remove weak links.And we shall recursively analyze for a given node what label does majority of its neighbor are pointing to.We shall then assign this label to the node, we go on about like this until the labels assigned to the nodes don't change.
Most importantly this method automatically identifies the cluster size unlike k-means where we choose k. OR in this case k is identified by the algorithm not us.
Get the top tfidf keywords for the documents (using Hadoop)
Use majorclust to cluster the documents
When this is done use a php script or so. To use this data to provide the document/webpage relevance.
Here each document's name is a url as in url1,url2 etc
Get the tfidf hadoop implementation here, Edit this particular code
1. Add more stop words use this , Look at WordFrequenceInDocument class and observe googleStopwords.add("word");.. use the list of stopwords in the given file and copy it here.
2.Write the output to a file and parse it to obtain the file format as shown in the pics earlier so the python script can be run as such
Majorclust for clustering docs are implemented using this python script,(got it from stackoverflow response and edited as per requirement) see here
To illustrate the entire process:
And the architecture for implementation would be: