Windows IT Pro
Windows IT Library
  - Advertise        
Windows IT Pro Logo

  Home  |   Books  |   Chapters  |   Topics  |   Authors  |   Book Reviews  |   Whitepapers  |   About Us  |   Contact Us

search for  on    power search   help
 






Index Server
View the book table of contents
Author: Beth Sheresh
Doug Sheresh
Robert Cowart
Published: April 1999
Copyright: 1999
Publisher: IDG Books
 


Abstract
This chapter describes how to deploy the Microsoft Index Server to index your Web server documents and provide your users with powerful query capabilities to locate document content and properties.

Internet and intranet Web and FTP sites are becoming a common means for companies to provide access to a wide range of enterprise-related information, ranging from products and services to policies and press releases. Consequently, the volume of information provided on Web and FTP sites in HTML and other document formats is increasing every day.

This ever-growing volume of information makes it increasingly difficult to locate any specific information you may be searching for at these Web or FTP sites. In order to ease the user location of relevant information, most Web sites now provide some kind of searching mechanism, though in many cases only text or HTML file content can be searched.

The Microsoft Index Server is an add-on service to the Internet Information Server (IIS) that enables the indexing of document properties and contents on the IIS server. Once the IIS server documents are indexed, users can use their browsers to query the Index Server in order to search for selected document contents or properties. Responses to user queries are presented back to the client browser as a series of HTML links to the documents containing the query properties or contents.

The Microsoft Index Server facilitates information location by indexing both contents and properties of all specified documents managed by the IIS server—not only index text and HTML files, but common document format files as well (such as .DOC and .XLS). The virtual directories assigned in the Internet Information Server are used by the Index Server to control indexing. The entire collection of directories and documents indexed by the Index Server are referred to as the corpus.


INDEX SERVER FEATURES

The Microsoft Index Server provides a rich set of features not commonly found in many Web site search engines. It provides powerful searching capabilities for corporate intranet usage, providing in-house network clients with a robust information location tool:
  • Full text searches. Web site visitors can search the Web site documents for words, sets of words, phrases, and entire sentences.
  • Multiple document formats searches. Index Server uses content filters based on open standards, enabling users to search through any document file type. Although most search engines process HTML or text documents, Index Server provides the flexibility to search through Word .DOC and Excel .XLS files, as well as support for introduction of other document type content filters.
  • Document properties searches. Index Server supports querying of documents by properties including date, author, subject, and file size.
  • Advanced search operators. Index Server supports advanced logical search operations, using Boolean operators (AND, OR, NOT), numeric operators (>,=>, <=, <,=), and the NEAR proximity operator (to locate words near other words).
  • Wild-card searches. In addition to exact text matching, users can query indexed documents using wild-card arguments.
  • Familiar Web-based forms for queries. Client access to Index Server leverages the user’s familiarity with Web-based/HTML forms, and helps with specifying search criteria and displaying the query results.
Index Server automatically creates the index on the selected document directories. Index Server is notified when documents are added to (or deleted from) the selected directories or changed since the last indexing, and the index for these new or altered documents is automatically updated. The indexing process is designed to operate as a background task to minimize system resource use while running, and requires no input from the administrator to complete the indexing operation.

Index Server is completely integrated with the IIS and Windows NT Server security, providing substantial control over access to indexed documents. Web site clients searching through the indexed documents are only allowed to view documents that they have permission to access.


INSTALLING INDEX SERVER

The Microsoft Index Server 2.0 is installed as part of the Windows NT 4.0 Option Pack. This Index Server is selected by default, so the base components of Index Server will be installed automatically. However, the Index Server online documentation is not selected by default. If you want the documentation copied to the server, it must be specifically selected via a Custom installation.

Overall size of the documents to be indexed, and the index itself, must be considered carefully in the Index Server installation. Depending on the number of languages you want Microsoft Index Server to support, the size of the installed base Index Server files can range from 3MB to 12MB.

Along with the base files for Microsoft Index Server, the index itself may require up to 40 percent of the combined size of all files to be indexed. Therefore, if you have 1GB of files to be indexed, the index created from these files can be up to 400MB.


UNDERSTANDING INDEX SERVER

Now let’s take a look at how the Index Server indexes documents. First, the Index Server operations are performed by the Content Index service, which operates as an NT Service configurable from the Services Control Panel. The Content Index service is installed during the Index Server setup, and starts up automatically when the server is booted. In the service properties, the Startup Type is automatic, and the Allow Service to Interact with Desktop checkbox is selected.

The Content Index service operates on documents contained with the set of directories (the scope), which are collectively grouped within a catalog. These documents are scanned for content and properties, and written to an index in the catalog’s directory. The content of the documents is filtered, subdivided into words (word breaking), and then normalized. Clients can query this index by accessing the index server across the network, using a variety of interfaces (ASP, SQL, IDX) and methods.

Index Server Catalogs
An Index Server catalog is the top-level unit containing the index and cached properties for all defined query scopes (all virtual directories specified to be indexed and searchable). A catalog is linked to a specific directory (set during installation) which contains the index data created by Index Server.

In the default installation of Index Server there is one defined catalog named Web, which contains the Directories and Properties folders. The Directories and Properties folders are created for every newly defined catalog. When you create a new catalog, these folders initially have no content. Catalogs are listed in the registry as subkeys of the following registry key:
  HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\
  ContentIndex\Catalogs
Thus the “Web” default catalog is listed as a subkey under the Catalogs key, and contains a Scopes subkey specifying the assigned directories to index, as well as a Location value specifying the drive and directory containing the catalog.

Note: Files contained in the catalog’s directory should not be deleted or moved manually.

Scanning Documents for Index Server
The Index Server supports two different types of scanning, full and incremental.

A full scan identifies all of the documents in the selected directories and adds them to the list of documents that Index Server will filter. A full scan is automatically done when a directory is added to the index scope, as well as when recovering after a serious error.

An incremental scan adds only modified documents to the filter list. An incremental scan is conducted on all indexed directories at every startup of the Index Server. Incremental scans are also performed when file change notifications are lost due to a high rate of updates causing buffer overflows.

Although NTFS provides change notifications informing Index Server of file additions, changes, and deletions, some remote file systems (such as Windows 95 and Novell NetWare) do not support file change notifications. On such file systems, Index Server will default to performing periodic scans of the selected shares, at an interval set in the ForcedNetPathScanInterval registry setting.

Scans can also be forced from the Index Server Manager. A forced scan should be done after installing or removing a filter or word breaker, as well as after changing a filter’s registry information.

How Indexing Works
Upon installation of the index server, the initial catalog is created and the contents of the selected directories are indexed.

Once the initial indexing of documents has been performed, Index Server does not rescan the selected directories unless instructed to do so by an administrator using the rescan option. Instead, the Index Server registers with the file system to be provided with change notifications so it knows when a file has been modified, added, or deleted, and then it updates the indexes as required by changes in files or content.

The indexing process itself is performed in three separate stages referred to as filtering, word breaking, and normalization. Content filters, word breakers, and normalizers are all modular components that can be created by independent software vendors to provide the needed linguistic operations for languages not currently supported by Index Server.

Filtering
In the first stage, the contents of all selected documents are filtered. At this stage, the index server uses its content filters to analyze different file formats. The content filters enables Index Server to read not only HTML or text files but also Microsoft Word (.DOC) files, Microsoft Excel (.XLS) files, and any other file formats for which it has a content filter.

As part of the filtering process, the content filter extracts portions of text from the documents and provides them to Index Server in a recognizable format. The content filter recognizes shifts in the language used in a document, and marks them appropriately so the correct word breaker and normalizer for that language is employed.

The content filter also handles embedded objects (such as embedded spreadsheets), identifies the type of embedded object, and then activates the required filter.

Word Breaking
The second stage of the indexing process is what we refer to as word breaking. The data provided by content filters is essentially a stream of characters. To identify distinct words from these streams of characters, the index server analyzes the character stream and determines where the breaks between words exist.

This process is language dependent, because the places where word breaks occur are different for each language. For example, whereas English and some European languages use white space and punctuation to indicate word breaks, some languages (such as Japanese) do not.

Included in the Index Server are seven language-specific word breakers, which enable the indexing process to identify and index words for these languages. The seven supported languages for index server are English, French, German, Spanish, Italian, Dutch, and Swedish.

Because the word breakers use streams of characters and convert them into words, Index Server stores all index data in Unicode to avoid code page and double-byte character set issues.

Normalization
The third and last stage of the indexing process is the normalization of the text. The normalizing process takes the words provided by the word breaker and removes the “noise words,” manages the capitalization and punctuation, and generally cleans up the output from the word breaker.

In the context of indexing, “noise words” refer to grammatically necessary connecting words like “of,” “the,” “it,” “is,” and “you,” which provide no informational content. Noise word references are not maintained within the index in order to save space, but they are stored in a customizable list on a per-language basis.


TYPES OF INDEXES

As part of its processing of document content and properties, Index Server employs three discreet types of indexes. Index Server uses both transitory, RAM-based word list indexes as well as persistent shadow and master indexes written to disk.

Word List Indexes
As the selected documents are filtered, all of the non-noise words from the documents are stored in a word list. Word lists are small indexes that are built and maintained in the server’s RAM. Once the number of RAM-based word lists exceeds the maximum number of word lists allowed, the word lists are merged and written to disk into a shadow index (a process referred to as a shadow merge). The maximum number of word lists is stored in the MaxWordLists registry entry, which defaults to 20. Word lists are re-created anytime the Content Index service is started.

Persistent Indexes
Persistent indexes are larger index information structures written to disk. You can have a maximum of 255 persistent indexes (shadow and master) in any given catalog. The two persistent index types are:
  • Shadow index. Created by the merging of distinct word lists and/or independent shadow indexes into a single index.
  • Master index. Formed by a master merge process, merging itself with all shadow indexes into a newly formed master index. Once the merging process is complete, all the shadow indexes and the original master index are deleted.


Page: 1, 2, 3

next page



ADS BY GOOGLE SPONSORED LINKS FEATURED LINKS

Critical Challenges of ESI & Email Retention
Are you storing too much electronic information? Get expert legal advice and better understanding of what you are required to do as an IT professional.

Become a fan of Windows IT Pro on Facebook!
Join us on Facebook and be a fan of Windows IT Pro!

Sustainable Compliance: Are You Having a Resource Crisis?
Read this white paper to examine trends in compliance and security management and review approaches to reducing the cost and operational burden of compliance.

Rev Up Your IT Know-How with Our Recharged Magazine!
The improved Windows IT Pro provides trusted IT content with an enhanced new look and functionality! Get comprehensive coverage of industry topics, expert advice, and real-world solutions—PLUS access to over 10,000 articles online. Order today!

Get It All with Windows IT Pro VIP
Stock your IT toolbox with every solution ever printed in Windows IT Pro and SQL Server Magazine plus bonus Web-exclusive content on hot topics. Subscribe to receive the VIP CD and a subscription to your choice of Windows IT Pro or SQL Server Magazine!



Order Your Fundamentals CD Today!
Gain an introduction to Exchange, learn server security requirements, and understand how unified communications can play a role in your messaging strategies with this free Exchange CD.
Windows IT Pro Home Register About Us Affiliates / Licensing Media Kit Contact Us/Customer Service  
SQL Connected Home IT Library SuperSite FAQ Wininfo News
Europe Edition Office & SharePoint Pro Windows Dev Pro Windows Excavator 
 
 Windows IT Pro is a Division of Penton Media Inc.
 Copyright © 2008 Penton Media, Inc., All rights reserved. Terms and Use | Privacy Statement | Reprints and Licensing