Drive has 700+ articles for digital transformation leaders written by StarCIO Digital Trailblazer, Isaac Sacolick. Learn more.

I should have written this post 5-10 years ago when I was an expert working with search engines and text data.  Better late then never, and hopefully this will help some of you that are starting out looking at solutions to better store, search, and data mine text documents.

So here are some things not to do (or avoid doing) with unstructured text Big Data:

    1. Develop home grown scripts to parse out keywords – This seems tempting at first especially if you are working with simple file types, a small number of files, and/or only need to extract out some simple keywords. But if you are working with a large number of documents and especially if you need to infer context (where in the document was this found) or entity relationships (how are these names connected) or other semantics, this can be an increasingly complex and expensive task. If you go down this path, consider using tools and libraries to get you started. Here is a good list.
    1. Leverage your RDBMS to search CLOBs – Another tempting thing to do is to dump unstructured text into a relational database  and use its built in capabilities to search or mine it. The major RDBMS platforms Oracle, Microsoft, and MySQL all off full text search capabilities. It’s been a number of years since I’ve been hands on with these technologies, but I don’t think much has changed. In Infoworld on things never to do with an rdbms, “You see a lot of people using complicated queries that are heavy on like and or operators. The results for these are ugly and the capabilities are weak.” Another good read is why don’t databases have good full text indexes.
  1. Integrate a full text search engine – Full text search engines become popular in the late ’90s  because they offered a lot more capabilities, scale, and speed in searching full text versus relational databases. These technologies brought the internet from what was a collection of linked web pages into web directories that were keyword searchable. If your job is to provide users a simple keyword search against a repository of small documents, this approach is often sufficient.But there are some fundamental problems with this approach. First, most search engines don’t support read/write transactions or ACID transactions, so data with these requirements are often stored in both a traditional database and the search engine. Sync’ing the two data stores can be complicated if the volume or velocity of the data is high and it often forces the developers to batch index the search engine on a delayed schedule. Second, search engines are relatively good at searching keywords and phrases, but are not particularly strong at searching by context. Like the RDBMS, developers are often forced to search using “like” clauses, and these engines fail to work well when the documents are large (say books) or if there are inherent relationships in the content (say searching for programming jobs at SaaS companies in the pacific northwest).

NoSQL databases have received a lot of debate, media discussion – and more importantly, success the last ten years as an alternative to the storage, query facilities, and data capabilities of the traditional RDBMS. My advice to architects and developers working with any form of complex, unstructured Big Data is to look past the simple approaches discussed above and prototype the capabilities in a NoSQL database designed for managing text and documents.

Published on:

Leave a Reply


StarCIO

My company, StarCIO, provides leadership, learning, and advisory programs for companies looking to accelerate delivering business value from digital transformation. Contact me if you’d like to learn more about partnering opportunities.


Isaac Sacolick

Join us for a future session of Coffee with Digital Trailblazers, where we discuss topics for aspiring transformation leaders. If you enjoy my thought leadership, please sign up for the Driving Digital Newsletter and read all about my transformation stories in Digital Trailblazer.


Coffee with Digital Trailblazers hosted by Isaac Sacolick

Digital Trailblazers! Join us Fridays at 11am ET for a live audio discussion on digital transformation topics:  innovation, product management, agile, DevOps, data governance, and more!


Join the Community of StarCIO Digital Trailblazers

About Drive

Drive Agility, Innovation, Transformation

Drive is the blog for digital transformation leaders brought to you by StarCIO and Isaac Sacolick.

Agility, Innovation, and Transformation are the three primary digital transformation core competencies that every StarCIO Digital Trailblazer must champion in their organizations. Learn more About Drive.


About the StarCIO Digital Trailblazer Community

StarCIO Digital Trailblazer Community

Revolutionizing traditional learning, networking, and advising experiences.

Visit the community


About StarCIO

StarCIO

About Isaac Sacolick

Isaac Sacolick

Author, 1,000+ articles, keynote speaker, Chief StarCIO Digital Trailblazer. Full bio


Driving Digital Newsletter

Driving Digital Newsletter

StarCIO Guides

StarCIO Agile Planning Guides

Digital Trailblazer

Digital Trailblazer by Isaac Sacolick

Driving Digital

Driving Digital by Isaac Sacolick

Driving Digital Standup

Driving Digital Standup

Coffee with Digital Trailblazers

StarCIO Coffee With Digital Trailblazers

Recognition

InfoWorld 2025 Judge
InfoWorld Technology of the Year 2024 Judge
Thinkers360 Top 10 in IT Leadership
Thinkers360 Top Agile Thought Leader
Thinkers360 Top DevOps Leader
Thinkers360 Top in Digital Transfomation
Thinkers360 Top in Analytics
Thinkers360 Top in Product Management

Discover more from StarCIO Digital Trailblazer Community

Subscribe now to keep reading and get access to the full archive.

Continue reading