Big Data

Starfish Helps Tame the Wild West of Massive Unstructured Data


(whiteMocca/Shutterstock)

“What data do you have? And can I access it?” Those may seem like simple questions for any data-driven enterprise. But when you have billions of files spread across petabytes of storage on a parallel file system, they actually become very difficult questions to answer. It’s also the area where Starfish Storage is shining, thanks to its unique data discovery tool, which is already used by many of the country’s top HPC sites and increasingly GenAI shops too.

There are some paradoxes at play in the world of high-end unstructured data management. The bigger the file system gets, the less insight you have into it. The more bytes you have, the less useful the bytes become. The closer we get to using unstructured data to achieve brilliant, amazing things, the bigger the file-access challenges become.

It’s a situation that Starfish Storage CEO and founder Jacob Farmer has run into time and time again since he started the company 10 years ago.

“Everybody wants to mine their files, but they’re going to come up against the harsh truth that they don’t know what they have, most of what they have is crap, and they don’t even have access to it to be able to do anything,” he told Datanami in an interview.

Many big data challenges have been solved over the years. Physical limits to data storage have mostly been eliminated, enabling organizations to stockpile petabytes or even exabytes of data across distributed file systems and object stores. Huge amounts of processing power and network bandwidth are available. Advances in machine learning and artificial intelligence have lowered barriers to entry for HPC workloads. The generative AI revolution is in fully swing, and respectable AI researchers are talking about artificial generative intelligence (AGI) being created within the decade.

So we’re benefiting from all of those advances, but we still don’t know what’s in the data and who can access it? How can that be?

Unstructured data management is no match for metadata-driven cowboys

“The hard part for me is explaining that these aren’t solved problems,” Farmer continued. “The people who are suffering with this consider it a fact of life, so they don’t even try to do anything about it. [Other vendors] don’t go into your unstructured data, because it’s kind of accepted that it’s uncharted territory. It’s the Wild West.”

A Few Good Cowboys

Farmer elaborated on the nature of the unstructured data problem, and Starfish’s solution to it.

“The problem that we solve is ‘What the hell are all these files?’” he said. “There just comes a point in file management where, unless you have power tools, you just can’t operate with multiple billions of files. You can’t do anything.”

Run a search on a desktop file system, and it will take a few minutes to find a specific file. Try to do that on a parallel file system composed of billions of individual files that occupy petabytes of storage, and you had better have a cot ready, because you’ll likely be waiting quite a while.

Most of Starfish’s customers are actively using large amounts of data stored in parallel file systems, such as Luster, GPFS/Spectrum Scale, HDFS, XFS, and ZFS, as well as the file systems used by storage vendors like VAST Data, Weka, Hammerspace, and others.

Many Starfish customers are doing HPC or AI research work, including customers at national labs like Lawrence Livermore and Sandia; research universities like Harvard, Yale, and Brown; government groups like CDC and NIH groups; research hospitals like Cedar Sinai Children’s Hospital and Duke Health; animation companies like Disney and DreamWorks; and most of the top pharmaceutical research firms. Ten years into the game, Starfish customers have more than an exabyte of data under management.

These outfits need access to data for HPC and AI workloads, but in many cases, the data is spread across billions of individual files. The file systems themselves generally do not provide tools that tell you what’s in the file, when it was created, and who controls access to it. Files may have timestamps, but they can easily be changed.

The problem is, this metadata is critical for determining whether the file should be retained, moved to an archive running on lower-cost storage, or deleted entirely. That’s where Starfish comes in.

The Starfish Approach

Starfish employs a metadata-driven approach to tracking the origin date of each file, the type of data contained in the file, and who the owner is. The product uses a Postgres database to maintain an index all of the files in the file systems and how they have changed over time. When it comes time to take an action on a group of files–say, deleting all files that are older than one year–Starfish’s tagging system makes that easy for an administrator with the proper credentials to do.

(yucelyilmaz/Shutterstock)

There’s another paradox that crops up around tracking unstructured data. “You have to know what the files are in order to know what files are,” Farmer said. “Often you have to open the file and look, or you need user input or you need some other APIs to tell you what the files are. So our whole metadata system allows us to understand, at much deeper level, what’s what.”

Starfish isn’t the only crawler occupying this pond. There are competing unstructured data management companies, as well as data catalog vendors that focus mainly on structured data. The biggest competitor, though, are the HPC sites that think they can build a file catalog based on scripts. Some of those script-based approaches work for a while, but when they hit the upper reaches of file management, they fold like tissue.

“A customer that has 20 ZFS servers might have homegrown ways of doing what we do. No single file system is that big, and they might have an idea of where to go looking, so they might be able to get it done with conventional tools,” he said. “But when file systems become big enough, the environment becomes diverse enough, or when people start to spread files over a wide enough area, then we become the global map to where the heck the files are, as well as the tools for doing whatever it is you need to do.”

There are also lots of edge cases that throw sand into the gears. For instance, data can be moved by researchers, and directories can be renamed, leaving broken links behind. Some applications may generate 10,000 empty directories, or create more directories than there are actual files.

“You hit that with a conventional product built for the enterprise, and it breaks,” Farmer said. “We represent kind of this API to get to your files that, at a certain scale, there’s no other way to do it.”

Engineering Unstructured File Management

Farmer approached the challenge as an engineering problem, and he and his team engineered a solution for it.

“We engineered it to work really, really well in big, complicated environments,” he said. “I have the index to navigate big file systems, and the reason that the index is so elusive, the reason this is special, is because these file systems are so freaking big that, if it’s not your full-time job to manage giant file systems like that, there’s no way that you can do it.”

The Postgres-powered index allows Starfish to maintain a full history of the file system over time, so a customer can see exactly how the file system changed. The only way to do that, Farmer said, is to repeatedly scan the file system and compare the results to the previous state. At the Lawrence Livermore National Lab, the Starfish catalog is about 30 seconds behind the production file system. “So we’re doing a really, really tight synchronization there,” he said.

Some file systems are harder to deal with than others. For instance, Starfish taps into the internal policy engine exposed by IBM’s GPFS/Spectrum Scale file system to get insight to feed the Starfish crawler. Getting that data out of Luster, however, proved difficult.

“Luster does not give up its metadata very easily. It’s not a high metadata performance system,” Farmer said. “Luster is the hardest file system to crawl among everything, and we get the best result on it because we were able to use some other Luster mechanisms to make a super powerful crawler.”

Some commercial products make it easy to track the data. Weka, for instance, exposes metadata more easily, and VAST has its own data catalog that, in some ways, duplicates the work that Starfish does. In that case, Starfish partakes of what VAST offers to help its customers get what they need. “We work with everything, but in many cases we’ve done specific engineering to take advantage of the nuances of the specific file system,” Farmer said.

Getting Access to Data

Getting access to structured data–i.e. data that’s sitting in a database–is usually pretty straightforward. Somebody from the line-of-business typically owns the data on Snowflake or Teradata, and they grant or deny access to the data according their company’s policy. Simple, dimple.

Better ask your storage admin nicely (Alexandru Chiriac/Shutterstock)

That’s now how it typically works in the world of unstructured data–i.e. data sitting in a file system. File systems are considered part of the IT infrastructure, and so the person who controls access to the files is the storage or system administrator. That creates issues for the researchers and data scientists who want to access that data, Farmer said.

“The only way to get to all the files, or to help yourself to analyzing files that aren’t yours, is to have root privileges on the file system, and that’s a non-starter in most organizations,” Farmer said. “I have to sell to the people who operate the infrastructure, because they’re the ones who own the root privileges, and thus they’re the ones who decide who has access to what files.”

It’s baffling at some level why organizations are relying on archaic, 50-year-old processes to get access to what could be the most important data in an organization, but that’s just the way it is, Farmer said. “It’s kind of funny where just everybody’s settled into an antiquated model,” he said. “It’s both what’s good and bad about them.”

Starfish ostensibly is a data discovery and data catalog of unstructured data, but it also functions as an interface between the data scientists who want access to the data and the administrators with root access who can give them the data. Without something like Starfish to function as the intermediary, the requests for access, moves, archives, and deletes would likely be done much less efficiently.

“POSIX file systems are severely limited tools. They’re 50-plus year’s old,” he said. “We’ve come up with ways of working within those constraints to enable people to easily do things that would otherwise require making a list and emailing it or getting on the phone or whatever. We make it seamless to be able to use metadata associated with the file system to drive processes.”

We may be on the cusp of developing AGI with super-human cognitive abilities, thereby putting IT evolution an even more accelerated pace than it already is, forever changing the fate of the world. Just don’t forget to be nice when you ask the storage administrator for access to the data, please.

“Starfish has been quietly solving a problem that everybody has,” Farmer said. “Data scientists don’t appreciate why they would need it. They see this as ‘There must be tools that exists.’ It’s not like, ‘Ohhh, you have the ability to do this?’ It’s more like ‘What, that’s not already a thing we can do?’

“The world hasn’t discovered yet that you can’t get to the files.”

Related Items:

Getting the Upper Hand on the Unstructured Data Problem

Data Management Implications for Generative AI

Big Data Is Still Hard. Here’s Why