Create Full Text Index for Existing Table (2) - Updatable Mode

This document is about how to create full text index for Updatable Mode
Updatable and AppendOnly model both build full text index in passive mode. The difference is that Updatable mode can add, modify and delete full text index data. AppendOnly mode can only add or delete (delete by code), it can not modify. The synchornization in Updatable mode is much more complex than that in AppendOnly mode.
If the data in table will only have add or delete operations, we suggest to use AppendOnly mode. Otherwise, user has to use Updatable mode.

Before we create full text index, we must create a database in HubbleDotNet.

We can build full text index for the tables or views in relational database after we create HubbleDotNet database. Now we use EnglishNews sample database to demonstrate how to build Updatable full text index.

 

Above image is EnglishNews table structure in SQL Server

Create Index table in HubbleDotNet

Right click on Query Analyzer, Choose "Create Table on News database node

As in above image, Type in HubbleDotNet table name. This name does not need to be exactaly same as the table in relational database. We use "EnglishNews" as HubbleDotNet table name.Then we type in full index directory and database adapter. The adapter we use is for SQL Server 2005 and later versions. The last thing to do is the connection string. We can test the connection string using "Test DB Connection String" button.

Once everything is configured, we can click "Next" Button.

Choose Index Mode

 


 

As in above image, we choose "Build Index from existing table" since we are in passive index mode

Then we type "EnlighsNews" in Existing table name or View Name. Since HubbleDotNet can do index for tables and views. We can input either table name or view name.

Because we need to update this table, so we choose increamental mode as Updatable. Click Next.

Configure Index Fields

As in above image, HubbleDotNet will automatically read table structure to help user to generate Updatable index table. Database table must have an Int or Bigint type ID field. The name of this field can be anything except DocId.

This field must be configured as Untokenized type, and it must be self increamental. If the index field is not int or bigint, then we must add an identity field in table or view, and build index according to that field. Click next when we finish.

Complete Index


This step will list the create statement for varification. Click Finish button to create index table.
[IndexOnly]
[DocId('Id')]
[Directory ('d:\test11\EnglishNews\')]
[DBTableName ('EnglishNews')]
[DBAdapter ('SQLSERVER2005')]
[DBConnect ('Data Source=(local);Initial Catalog=Test;Integrated Security=True')]
Create table EnglishNews
(
Id BigInt Untokenized NULL default 0 ,
GroupId Int Untokenized NULL default 0 ,
SiteId Int Untokenized NULL default 0 ,
Title NVarchar(max) Tokenized Analyzer 'EnglishAnalyzer' NULL default '' ,
Content NVarchar(max) Tokenized Analyzer 'EnglishAnalyzer' NULL default '' ,
Time DateTime Untokenized NULL default '1900-1-1' ,
URL NVarchar(max) NULL ,
ImageUrl NVarchar(max) NULL 
);
Click Yes to start Index


Click rebuild to create full text index for EnglishNews

  Updatable mode can only rebuild for empty index. If HubbleDotNet has an index record for this table, it won't allow rebuild. User must use ReuilbdWholeTable to clear all index records and re-create index table. If user wants to add or modify index table, they must use auto-Sync function. To use Auto-Sync function, please see Automatically synchornize data from table.


Once we build full text index, we can optimize it, then we can start search.

Search News

Example 1

select top 10 Id, Title, Score from EnglishNews where title

match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' order by score desc

 

The arguments after each keyword segement are explained below:

First argument means the weight of the keyword. It is 5000 in this example.

Second argument means the real position in the setence. eg: "abc" at position 0, "news" should start from 3

top 10 means first 10 records.

 

Example 2

SQL

select top 10 Id, Title, Score from EnglishNews where title contains 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9'
order by score desc



We can also use "Contains" to search. It will get much less records, because "Contains" search requires match for  all four keywords

Since "to" is a common word, we want to matching with "abc", "news", "cut" first. If record contains "to", then its rank should be highter than the record that does not have "to". This search algorithm is used in google or baidu. In google search, if keywords contains stop words. It will apply AND condition for non-stop words, and OR condition for stop words. But the records that contains stop words will have highter rank than those don't have.

SQL

select top 10 Id, Title, Score from EnglishNews where title contains 'abc^5000^0 news^5000^3 to^5000^7^1 cut^5000^9' order by score desc

In this example, it returns one more record at third position. That record does not have "to" word. The arguuments after word "to" are like "to^5000^7^1"

The first two arguments are weight and position. The third argument is a flag. 1 means OR condition.

Example 3

Search title that contains any word in abc, news, to, cut, and sort by matching, and give statisitics for GroupId field. Output maximum 10 groupid.

SQL

[GroupBy('Count', '*', 'GroupId', 10)]
select top 10 Id, Title, Score from EnglishNews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' order by score desc


The last argument 10 is optional. It means return first 10 results with most counts. If 10 is missing, it will return all statistics results.

Group By 字段必须为 untokenized 类型索引字段且不能是字符串类型。

Group By must be untokenized index field and it can not be a char type.

The execution of this sql will return 2 data tables. First table is the result of select, second is group by result.

That means it will do full text search and statistics at same time.

 

[GroupBy('Count', '*', 'GroupId', 10)]
[GroupBy('Count', '*', 'SiteId', 10)]
select top 10 Id, GroupId, SiteId, Title, Score from EnglishNews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' 
order by score desc
 

This SQL output statistics for GroupId and SiteId. Many ecommerce sites require mutliple statistics rsult for one search.
HubbleDotNet can output multiple statistics result using one sql. 

The GroupBy function in HubbleDotNet is done in low level implementation, it is much simpler and faster than that in Lucence.
Lucence implements the statistics based on filters after search.

Group by is a pseudo SQL, for example
select top 10 GroupId, count(*) as count from englishnews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' 
group by GroupId order by count desc
 
Third example:
[GroupBy('Count', '*', 'GroupId,SiteId', 10)]
select top 10 Id, GroupId, SiteId, Title, Score from EnglishNews where title match 'abc^5000^0 news^5000^3 to^5000^7 cut^5000^9' 
order by score desc

 

Above SQL can do statistics on multiple fields. Please be aware that when we do multiple Group By, the total byte can not exceed 8 bytes.

That means at most 2 int type or 8 tinyint type.

For more example, please check Create Full Text Index for Existing Table(1) - Append Only Mode

Last edited Aug 23, 2011 at 3:08 AM by linkspeed, version 7

Comments

No comments yet.