This project is read-only.

Create Full Text Index - Updatable

This document is about how to create full text index for Updatable Mode

Updatable and AppendOnly model both build full text index in passive mode. The difference is that Updatable mode can add, modify and delete full text index data. AppendOnly mode can only add or delete (delete by code), it can not modify. The synchornization in Updatable mode is much more complex than that in AppendOnly mode.
If the data in table will only have add or delete operations, we suggest to use AppendOnly mode. Otherwise, user has to use Updatable mode.

Before we create full text index, we must create a database in HubbleDotNet.

We can build full text index for the tables or views in relational database after we create HubbleDotNet database. Now we use EnglishNews sample database to demonstrate how to build Updatable full text index.
image

 

Above image is EnglishNews table structure in SQL Server

Right click on Query Analyzer, Choose "Create Table on News database node

 

 

As in above image, Type in HubbleDotNet table name. This name does not need to be exactaly same as the table in relational database. We use "EnglishNews" as HubbleDotNet table name.
Then we type in full index directory and database adapter. The adapter we use is for SQL Server 2005 and later versions. The last thing to do is the connection string. We can test the connection string using "Test DB Connection String" button.

Once everything is configured, we can click "Next" Button.

Choose Index Mode

 

image

 

As in above image, we choose "Build Index from existing table" since we are in passive index mode.

Then we type "EnlighsNews" in Existing table name or View Name. Since HubbleDotNet can do index for tables and views. We can input either table name or view name.

Because we need to update this table, so we choose increamental mode as Updatable. Click Next.Configure Index Fields

 

image

 

 

As in above image, HubbleDotNet will automatically read table structure to help user to generate Updatable index table. Database table must have an Int or Bigint type ID field. The name of this field can be anything except DocId.

This field must be configured as Untokenized type, and it must be self increamental. If the index field is not int or bigint, then we must add an identity field in table or view, and build index according to that field. Click next when we finish.

Complete Index

image

 

This step will list the create statement for varification. Click Finish button to create index table.

[IndexOnly]

[DocId('Id')]

[Directory ('d:\test11\EnglishNews\')]

[DBTableName ('EnglishNews')]

[DBAdapter ('SQLSERVER2005')]

[DBConnect ('Data Source=(local);Initial Catalog=Test;Integrated Security=True')]

Create table EnglishNews

(

Id BigInt Untokenized NULL default 0 ,

GroupId Int Untokenized NULL default 0 ,

SiteId Int Untokenized NULL default 0 ,

Title NVarchar(max) Tokenized Analyzer 'EnglishAnalyzer' NULL default '' ,

Content NVarchar(max) Tokenized Analyzer 'EnglishAnalyzer' NULL default '' ,

Time DateTime Untokenized NULL default '1900-1-1' ,

URL NVarchar(max) NULL ,

ImageUrl NVarchar(max) NULL 

);

 

[IndexOnly]
[DocId('Id')]
[Directory ('d:\test11\EnglishNews\')]
[DBTableName ('EnglishNews')]
[DBAdapter ('SQLSERVER2005')]
[DBConnect ('Data Source=(local);Initial Catalog=Test;Integrated Security=True')]
Create table EnglishNews
(
Id BigInt Untokenized NULL default 0 ,
GroupId Int Untokenized NULL default 0 ,
SiteId Int Untokenized NULL default 0 ,
Title NVarchar(max) Tokenized Analyzer 'EnglishAnalyzer' NULL default '' ,
Content NVarchar(max) Tokenized Analyzer 'EnglishAnalyzer' NULL default '' ,
Time DateTime Untokenized NULL default '1900-1-1' ,
URL NVarchar(max) NULL ,
ImageUrl NVarchar(max) NULL 
);

 

 

Click Yes to start Index

image

Click rebuild to create full text index for EnglishNews 
 Updatable mode can only rebuild for empty index. If HubbleDotNet has an index record for this table, it won't allow rebuild. User must use ReuilbdWholeTable to clear all index records and re-create index table. If user wants to add or modify index table, they must use auto-Sync function. To use Auto-Sync function, please see Automatically synchornize data from table.

 

 

image

 

image

 

Once we build full text index, we can optimize it, then we can start search.

 

Search news

Sample 1

Search titles that contains any keyword in "abc news to cut",and sort by matching score

SQL 语句: select top 10 Id, Title, Score from EnglishNews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' order by score desc

Result

image

 

The arugments after each keyword are

First argument means the weight of this keyword, eg: 5000

Second argument means the position of this keyword in search string. eg: "abc" is at position 0 in "abc news to cut". and news is at position 3 in same string

Top 10 means it only returns first 10 records

We can see the return result sorted by matching score

Sample 2

Search titles that contain all keywords in "abc news to cut", and sort by matching score

SQL

select top 10 Id, Title, Score from EnglishNews where title contains 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' order by score desc

 

Results:

image

 

Using "Contains", we can do presice match. The records returned by Contains will be less than Match because of the AND condition in keywords.
Again, because "to" is a common word, we want the record that contains "abc news cut" and "to" gets nigher score than the record that contains "abc news cut" but not "to". This search algorithm has been used in google or baidu. They use AND relation to non-frequent keywords and OR relation to frequent keywords.

SQL

select top 10 Id, Title, Score from EnglishNews where title contains 'abc^5000^0 news^5000^3 to^5000^7^1 cut^5000^9' order by score desc
Result:

image

 

This time there has one more record at third position. This record does not have "to" keyword

The arguments after "to" keyword: to^5000^7^1

first two arguments are weight and position

Third argument is flag, 1 means or condition

Example 3

Search title contains any of "abc news to cut", order by matching score and also do statistics for GroupId field. Return most 10 group id

SQL

[GroupBy('Count', '*', 'GroupId', 10)]
select top 10 Id, Title, Score from EnglishNews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' 
order by score desc
 

Result:

 

image

Last argument in attribute is "10", it means we return top 10 records in statistics. System will return all statistics result if it is missing.

Group By field must be untokenized index type and it can not be a char type.

this sql will return 2 data tables. First one is query result, second one is statistics result.

One step further

[GroupBy('Count', '*', 'GroupId', 10)]
[GroupBy('Count', '*', 'SiteId', 10)]
select top 10 Id, GroupId, SiteId, Title, Score from EnglishNews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' order by score desc
image 

This example returns statistics result for GroupID and SiteId. Many e-commerce web site need more than one statistics results for same query. HubbleDotNet can return those result in one query. It is much faster than Lucene.

 

SQL for GroupBy

select top 10 GroupId, count(*) as count from englishnews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' group by GroupId order by count desc

One Step Further

[GroupBy('Count', '*', 'GroupId,SiteId', 10)]
select top 10 Id, GroupId, SiteId, Title, Score from EnglishNews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' order by score desc

 

image

 

Above SQL can do statistics for mutliple fields. Please Note that when grouping multiple fields, the total number of byte can not exceed 8 bytes and it can have at most 2 int type or 8 tiny int type

Last edited Sep 13, 2011 at 8:41 AM by linkspeed, version 2

Comments

No comments yet.