Harmony Hollow Software Special web hosting offer - LIMITED TIME ONLY


From Corpora to Matching


Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

Add to these social bookmarking sites:

Add to: Mr. Wong Add to: Webnews Add to: Icio Add to: Oneview Add to: Folkd Add to: Yigg Add to: Linkarena Add to: Digg Add to: Del.icio.us Add to: Reddit Add to: Simpy Add to: StumbleUpon Add to: Slashdot Add to: Netscape Add to: Furl Add to: Yahoo Add to: Spurl Add to: Google Add to: Blinklist Add to: Blogmarks Add to: Diigo Add to: Technorati Add to: Newsvine Add to: Blinkbits Add to: Ma.Gnolia Add to: Smarking Add to: Netvouz Information

MORE RESOURCES:

Weblinx Blog (blog)

Best SEO Book
BigNews.biz (press release)
The best SEO book money can buy is SEO Made Simple, a comprehensive search engine optimization book. If you want to learn the art of search engine ...
Ranking in Google News can boost SEO effortsLast Click News
SEO India: An SEO Trial Like Never Before by Pardhi Media MarketingPR-inside.com (press release)
Google News can be a valuable target surface for SEO effortsBrafton
SEO Consult (blog) -SYS-CON Media (press release) -PR-USA.net (press release)
all 28 news articles »


Overlooked But Beneficial On-Page SEO Elements
Search Engine Land (blog)
Let's take a closer look at the individual elements you can use for both SEO and user experience benefits. Traditionally, header tags are used by ...
Best online SEO methodology in IndiaBigNews.biz (press release)
Making of Meta tags in SEOSiliconindia.com (blog)
SEO India: Providing capable SEO Services to bring the SEO Company India on ...BigNews.biz (press release)

all 4 news articles »


Plasma Computing Group Gave New Height to Search Engine Optimization (SEO ...
Online PR News (press release)
Main motive of any SEO expert is to get your site listed on the first page of the major search engines. If you are roaming beyond 2nd or 3rd page you are ...

and more »


Search Engine Optimization (SEO) for Microsoft Silverlight
SYS-CON Media (press release)
Brad Abrams is currently the Group Program Manager for the UI Framework and Services team at Microsoft which is responsible for delivering ...

and more »


Impact Media (blog)

Bing hits all-time high
TG Daily
Danny Sheppard, president of search engine optimzation firm Titan SEO, said in a Microsoft press release, "SEO companies still need to pay attention to Bing ...
Bing's Market Share at All-Time HighPR Newswire (press release)
Most Visited Website – Not GoogleThe Positive SEO Blog Community (press release) (blog)

all 24 news articles »


BigNews.biz (press release)

Help Your SEO Plan Succeed With Testing
SEO Consult (blog)
Planning and testing are vital components of any business operation and the same is true when it comes to SEO. As any professional will tell you, ...
Hiring an SEOClickZ News
Free SEO From Today's Top Selling BookBigNews.biz (press release)
Web Wise Media Enhances 2010 SEO Options with Custom Blog Posting and DesignPR-inside.com (press release)
WebWire (press release) -Lansdale Reporter -SEO Consult (blog)
all 43 news articles »


Han Seo Hye is popular after appearance on 1N2D
allkpop
by sweetrevenge on March 18, 2010 @ 7:35 PM (EDT) · 13 comments Universal Ballet member Han Seo Hye has been receiving great interest from the public after ...



Bullseye Media Ranked Number 1 SEO Web Development Company in the UK
PR Web (press release)
London (PRWeb UK/PRWEB ) March 18, 2010 –- Bullseye Media's own clients have ranked them number one, but it's finally official, Bullseye Media is the #1 SEO ...
Bullseye Media Recognised As Top UK Link Building CompanyPR Web (press release)
Bullseye Media Recognized as the Best Web Development Firm in the United ...PR Web (press release)

all 5 news articles »


Impact Media (blog)

Google Social Search—a new tool for SEO
Last Click News
Google's latest conquest in social web could be a useful digital marketing tool for search engine optimization (SEO). Google Social Search, which was ...
Google Social Search could mean new opportunities for SEOBrafton
SEO Tips For Google Social SearchMediapost.com
SEO tips for Google Social Search offeredDirectNews
Impact Media (blog) -Search Engine Land (blog) -eWEEK Europe UK
all 18 news articles »


ISO 9001 Accreditation Awarded to SEO Company Impact Media
BigNews.biz (press release)
Southampton, UK – March 18 2010 – Hampshire SEO Company, Impact Media have successfully achieved official ISO 9001 accreditation. Awarded in recognition of ...


Google News

Home | Site Map
© 2008