Thursday, 6 December 2012

Document similarity using Hadoop

This article provides means to achieve document similarity which can be used for recommending similar webpages,news feeds,articles etc from the big web.
To begin with we shall be using hadoop for handling the colossal data.

The procedure involves considering each document and eliminating its stop words,(stop words are redundant words such as a,the,an,which etc..A list of stop words can be obtained from here) .Then we can use map reduce algorithm in hadoop distributed file system to identify the TFIDF associated with each word in a document

What is TFIDF ?
Well its composed of TF (Term frequency)and IDF(Inverse Document frequency)
The term frequency indicates the how many times a term(word ) occurs in a document.And the document frequency (DF in IDF, Also IDF=1/DF) indicates how many times the word occurs in the document.. However actually they give you the term count and document count for the word.To compute the frequency just divide the term count by total number of words in that document

                                     (number of times W occurs in document D)
OR TF for word W=  ------------------------------------------------------------------
                                      (Total number of words in document D)

and IDF would be


                                     (number of times W occurs in all documents)
                                DF =  ------------------------------------------------------------------
                                      (Total number of words in all documents)

So you can agree that higher the TF and lower the DF ,the word is important and can be used to analyze the characteristics of the document.Say a word w which occurs many times in a single document D but less in all other documents then this word can be used statistically to identify D.

So TFIDF would be  TF * log(1/DF)
Log is logarithm to base 10
_
A_________
|   w         |                
|           w |                 
|    w        |
|_______w|


B__________
|   w     |
|           |
|            |
|_______|


C________
|         w|
|           |
|           |
|_______|


D__________
|            |
|            |
|            |
|_______|
Consider the above example where document A has a lot of W words the same W is rarely present in other document then it has a lower DF, and higher TF
Thus the TFIDF for the word W in document A is high.

So what we do here is identify the tfidf of all the word|document in corpus .Set a threshold for the  TFIDF and consider those words whose tfidf is above the threshold.We can identify the cosine similarity between two documents from these TFIDF

Build a graph of the corpus wherin each node is a document and the edge between them are weighted based on cosine similarity .Higher the weight more similar are the nodes.

(Doc a)------<cosine similarity(a,b)>-----(Doc b) ..etc

Cosine-similarity between two documents a,b= sum ( TFIDF of a * TFIDF b) 

Use graph clustering technique to cluster similar documents.We shall implement
this paper for clustering.

The technique involves randomly move about the graph initially and intialise the nodes by random labels , then we shall set a threshold on the weight of the edge between nodes , or we remove weak links.And we shall recursively analyze for a given node what label does majority of its neighbor are pointing to.We shall then assign this label to the node, we go on about like this until the labels assigned to the nodes don't change.
Most importantly this method automatically identifies the cluster size unlike k-means where we choose k. OR in this case k is identified by the algorithm not us.

Get the top tfidf keywords for the documents (using Hadoop)

Use majorclust to cluster the documents

When this is done use a php script or so. To use this data to provide the document/webpage relevance.
say the user chooses/views certain document/webpage the relevance can be provided to similar context
Here each document's name is a url as in url1,url2 etc

Get the tfidf hadoop implementation here, Edit this particular code
1. Add more stop words use this , Look at WordFrequenceInDocument class and observe googleStopwords.add("word");.. use the list of stopwords in the given file and copy it here.
2.Write the output to a file and parse it to obtain the file format as shown in the pics earlier so the python script can be run as such

Majorclust for clustering docs are implemented using this python script,(got it from stackoverflow response and edited as per requirement) see here



To illustrate the entire process:
And the architecture for implementation would be:

Sunday, 11 November 2012

How to make a web server

Any embedded device can be connected to the cloud and provide its own service,So after choosing you hardware such as a ARM kit or even a FPGA, The simplest method is to port an operating system on it.

Check link for porting linux on arm9
Check link for porting FreeRtos on arm7
check link for porting RT-Linux on powerpc (FPGA)


Once you have ported an OS , you got to enable ethernet facilities .

Then you can run a socket based program (check tutorial) , with this program running all you got to do is access your machine from any other PC connected to the ethernet by specifying the IP address.
That's it, by pinging the server it can execute any task as defined by your program,(Note socket just provides the means of connection,the program must also process data).

If i define port id as 80 and provide service of handling web pages or html contents , then i would have developed a HTTP server.

For a FTP server (check sourcecodes) you have to use a port id of 21.


Porting FreeRtos on ARM 7

Start by downloading a FreeRtos kernel in this case V5.3.0 , but feel free to use your favorite version from here which is completely free.And the ARM7 is KEIL MCB2100

We shall also be needing Flash Magic from here and Keil uVision Integrated Development Environment.

Start uVision IDE (I have used uVision 4).

Open project.
Select RTOS Demo from FreeRtos folder
Go to Project in uVision and select Device .
Then build the target
Here you may have to set the crystal frequency and other options (Check the data sheet for your ARM), in my case the frequency would be 12Mhz
In the output tab select create HEX file

Now Build Target in Project tab.This creates the hex file in your directory.
Start Flash Magic
     communication settings
        Select Device (mine would be LPC2129)
        COM port COM1(or others)
        Baud rate 115200(your choice)
        Interface none
        Oscillator 12 (Mhz)

Erase All Flash

Select Hex File

Set "Verify after programming"

Start it!

Thats it , FreeRtos is on ARM7 , you are greeted by blinking LED's
Now prior to generation of HEX file you can EDIT the FreeRtos source code , You can even define your own task in its main program; Thus you can execute just any application.

Monday, 15 October 2012

Security in mobile computing

I present here a report on security in mobile computing after in depth analysis of several research papers.
A summary is presented as pdf for free download  here

In the given pdf consider the Trust based security , A detailed explanation is presented


"Jhon wants to access services such as printer but is not authorised so he request susan by sending request and credentials,susan sends a certificate and may impose constrains (say impose time restrictions on access to printer)on the request, this data will then be sent to security agent which will service request.
This can be used to secure smart spaces

Role is assigned to each user(software/individual),role is associated with security policies and authority
A hierarchical layer of security agents
Authorised user can delegate authority to other user by signed assertion,these signed assertions are validated by agents and the request is serviced
Delegation chain :If a entity delegate authority to a malicious module it can lose its delegation capability
"

Now for the concentric centric security

"The user application will have a corresponding mobile agent (MA)proxy ,The MA is software with data which can move from environment to other,thus it can save the session state and move between network,allowing the user to roam and seamlessly connect with different networks
Ubiquitous context based security middleware(Ubicosm) allows the user/service provider to specify the security capabilities and requirements as a metadata


Ubicosm will also appraise the resource availability and state in the new environment
And along with the metadata will provide a visible window comprising of only those resources which the MA can access,and which are validated. Also the metadata allows the MA to identify other MA’s of required security and other features and interact.
Changing security requirements are reflected as change in metadta with no change in codes

"

Sunday, 14 October 2012

Porting Linux on ARM 9

We shall consider the TS-7300 arm9 from Technologic Systems.TS-7300 allows us to port using the SD-card.

Problems of porting linux on ARM7 lpc2100
please note any attempt to port linux onto boards like arm7 mcb2100 is intractable with generic methods as the linux incorporate memory management unit which needs to be supported and which is missing in the said board (mcb2100)
Additionally porting kernel image onto flash which typically has size of 256KB is again not possible for linux,Pretty bulky!

To begin porting on TS-7300

First we shall partition the SD card
1.Download SD image from here
2.Extract the SD image ,insert the SD card into the SD card reader(either you can use card reader available in your system or may have to acquire a card reader,generally they are available in standard Laptops)
3.Format the SD card,this will delete all partitions(backup already loaded data if required).For formating try Gparted tool,install by :
$>sudo apt-get install gparted
4.Identify the SD card in your system.
$>fdisk -l 
lists the partition in your machine ,The SD card is identified by /dev/sdxx xx being certain integer/character variables






5.Bootable SD card can then be developed by using
$>dd if=/path_to_SDimage/sdimage.dd of=/dev/sdxx
Note the dd command is a linux nuke, if you do not specify the correct disk at of= field then it may crash your system.Happened to me !.In case you do not heed the warning , not to worry check this
6.Minicom is something like hyperterminal in windows, allowing you to work with serial port, to install
$>sudo apt-get install minicom
7.Insert the SD card onto the TS-7300 machine and using a RS232 cable connect the TS-7300 with your system,A serial to USB adaptor may be required if your system does not have the DB connector for the Serial port.
8.Switch on TS-7300 and use the minicom to access the serial port
$> minicom -s
shows a configuration window

To provide appropriate serial port , you have to find which serial port is connected to the TS-7300 for that
$>dmesg | grep ttyUSB
lists all the USB terminals,in case you are using the serial to usb adaptor.However if you are directly using a single serial cable then
$>dmesg | grep ttyS*
will list the serial port , Identify your serial port and enter in serial device field of minicom.Note for example the above screen grab indicates /dev/ttyUSB0 as the USB port to which the serial cable is interface.
Also set the desired,data rate
9.Choosing Exit in minicom will establish the link between host machine and peripheral, In this case you may see the Board booting up !
Enter
$> uname -r
In the minicom terminal this indicate the version of kernel on the board,We have ported an OS !!

link is very useful for developers,The repository also provides various flavors of OS and tools.

Filesystem may be needed to develop device drivers,If the filesystem is not inherently provided by the SD image ,a filesystem is needed to be created after partitioning  the SD card.

The C program which provides the functionality needs to be crosscompiled and introduced into the userspace ie the filesystem , by copying the crosscompiled C code(Crosscompiling produces an executable) onto the SD card.

Acquire the cross compiler by using
$>wget http://code.google.com/p/princess-alist/downloads/arm-linux-gcc-4.3.2.tgz

Extract at usr/local/bin , and set the environment fields by
$> export$PATH=PATH:/usr/local/bin/4.1.1-920t/bin/
$>export ARCH=arm

$>export CROSS_COMPILE=arm-linux-

Cross compile by
$>arm-linux-gcc -o objectname filename.c

Transfer the objectname file to the SD card and after booting you can execute it by using ./objectname in ARM9

Check out a LED driver,link
 
This is sufficient to get you started on ARM!

We have just ARMed the Penguin !

PS:If you face any problems , specify the same in the comment field.