Have terabytes of data at your disposal but no way to access it? This post contains hard-won recommendations gleaned from years of experience working for a search engine software company specialising in enterprise and developer search engines. While the recommendations make specific reference to the dtSearch® product range, they are universally applicable.
How to Create an Index for Terabytes of Data
The first recommendation is to leverage the search engine’s indexing capabilities rather than conducting an unindexed search. Unindexed search is inefficient. Indexed search is often instantaneous, even over terabytes of data. (Technically, concurrent indexed searches can occur in a single thread or across multiple threads in an online or network context without interfering with one another.)
What exactly is an index?
An index is just an internal mechanism that enables the search engine to search through terabytes of data in a matter of seconds. How does one acquire such an index? Simply point the search engine to whatever you wish to index, and the search engine will take care of the rest. It’s fine if you have no notion what’s in your data.
The search engine can recognise file formats such as Microsoft Word, Access, Excel, PowerPoint, and OneNote, as well as email attachments, PDFs, and web-based formats such as HTML or XML.
The search engine can index data automatically by sifting through compressed archives such as RAR and ZIP.
However, what if certain PDF files are saved with Microsoft Word file extensions such as.DOCX, while some Access files are saved with Excel file extensions, and so forth?
This condition is not problematic. Without referring to the file extension, the search engine’s document filters may check inside each file to establish the correct file type.
Additionally, the document filters can search for nested documents within files.
If a ZIP or RAR file contains an embedded Excel file, and the Excel file contains an Access database and a Word document, the document filters will also locate and parse the embedded documents. Note that while text that is black on black, white on white, or red on red may appear to be invisible when viewed in the file’s associated programme, it is simply plain text for a search engine.
One final pointer inside the broader “create an index” category. ADVICE: index email files directly, if possible as PST, OST, or MSG files, rather than via Outlook.
Although the search engine can index Outlook emails via Outlook, using Outlook / MAPI slows the indexer down in comparison to direct access to these file types.
Examine Index Logs
The second recommendation is to examine the index logs. The logs can be used to determine which files the search engine is unable to index for whatever reason. A key example is PDFs that are “image-only.”
A standard PDF file contains both text and graphics. If you can copy and paste a selection of text from a PDF into another file, you know it contains genuine text. However, “image-only” PDFs are distinct.
If you attempt to copy and paste what appear to be words from these, the operation fails. However, without actual text, only images, the search engine will be unable to index and search the contents of such files. (The search engine will still be able to index the metadata, but not the main event.)
The problematic element is that “picture only” PDFs can exist alongside standard PDFs in data collections without external identifiers indicating their presence.
However, the indexing log file will identify PDFs that include simply images. You may then convert these “picture only” PDFs to standard PDFs using an OCR application such as Adobe Acrobat and add them to your index.
Consider Caching Documents
The third recommendation is to consider indexing materials that are vulnerable to a remote or otherwise unreliable link or may even be completely unavailable in their original location. A little explanation of how search results are displayed can assist in explaining this suggestion.
A search engine performs both single-threaded and multithreaded search requests utilising the index’s data. To show the complete text with highlighted hits, the search engine accesses the original file or other data and retrieves a copy of the item. The search engine then use the index to determine the location of the hits within that copy and displays them in the search results display.
The highlighted hits are the light that shines through your data.
This technique is uncomplicated if the original file is easily accessible and retrievable. However, if the original file is unavailable or is corrupted, the display process becomes inconvenient. The solution is to cache or store a complete copy of the file or other data in addition to the index. Even without access to the originals, the display process stays smooth and instantaneous while using that cache.
The disadvantage of caching is that it significantly increases the size of the index, as the index now stores the whole text of all files in addition to the basic index. However, caching is often worth it when the original is slow or unavailable.
Maintain Your Indexes
The following suggestion is to keep your indexes current with newly added, deleted, or modified files. This procedure is much simpler than it appears. It is not necessary to create an index from scratch to add something new. Rather than that, the search engine can examine each file automatically to determine if it has been edited, removed, or added since the last index build and simply index “the difference.”
A compress option minimises the additional baggage that can accumulate as a result of successive index updates.
Additionally, you can schedule automatic index updates via the Windows Task Scheduler. Notably, while an index is being updated, searches, including concurrent searching, can continue uninterrupted.
Refine Your Request for Information
The final step is to exercise caution while framing a search request. For instance, natural language searching enables you to input a search request in “plain English” or even copy and paste an entire paragraph of text and receive relevance-ranked search results.
The term “simple English” is used here to encapsulate the essence of natural language search. However, it is worth noting that a search engine may operate automatically in any of the hundreds of Unicode languages, including right-to-left languages such as Hebrew and Arabic, as well as double-byte languages such as Chinese, Japanese, and Korean.
Under the hood, relevance ranking operates in the following manner. If you search for purple or blue and find that blue is prevalent across your indexed data but purple references are significantly more scarce, then files containing purple will receive a better rating for relevance. Additionally, files with a higher density of purple mentions obtain a higher relevancy ranking.
Natural language search requests are simple to construct; it is frequently more fruitful to spend the time entering a precision search request instead.
Additionally, a search engine can support phrase searching, Boolean and/or/not search requests, proximity searching in one direction (X before Y) or both directions (X before or after Y), concept searching, metadata-specific searching, number and numeric range searching, and date and data range searching, among other capabilities.
Utilize these many choices to help you narrow down your search queries and get exactly what you’re looking for. Additionally, don’t overlook the more advanced search capabilities, such as the ability to recognise credit card numbers in data, to generate and search for file hash values, to include positive and negative variable term weighting in specific metadata, and so on.
Fuzzy searching is a special type of search that you may wish to employ in conjunction with both natural language and structured search requests. Fuzzy searching is used to detect tiny typographical errors that may occur in emails and OCR text. Thus, a search for purple, for example, would also include a low-level fuzzy search to ensure that you discover what you’re looking for, even with minor misspellings.
Finally, in terms of search requests, you are not restricted to your usual sorting option.
If natural language searching is set as the default sorting option, you can click to quickly switch to ascending or descending file date, ascending or descending file size, or the occurrence of keywords in specific metadata. Each of these options creates a new window for searching and retrieving items.
Files That Are Relevant
The sixth search tip is that once you’ve located what you’re looking for, you can tag and copy the important files you require.
Additionally, you can copy files from within a larger email archive or a compressed ZIP or RAR archive (no additional “un-ZIP” required). Additionally, you can instruct the search engine to generate a search report containing all results with as much context as you desire.
Search reports can be run on all retrieved files or on selected files.
These tips will assist you in navigating terabytes of data, whether it is your own or data from a third-party source that you have never seen before.
Image Credit: thirdman; pexels; thank you!