The effects of climate change and inequality are threatening societies across the world, but there is still an annual funding gap of US$2.5 trillion to achieve the UN Sustainable Development Goals by 2030. A substantial amount of that money is expected to come from private sources like pension funds, but institutional investors often struggle to efficiently incorporate sustainability into their investment decisions.
Matter is a Danish fintech on a mission to make capital work for people and the planet. The company helps investors understand how corporations and governments align with sustainable practices, across climate, environmental, social and governance-related themes. Matter has partnered with leading financial companies, such as Nasdaq and Nordea, on providing sustainability data to investors.
Matter collects data from hundreds of independent sources in order to connect investors to insights from experts in NGOs and academia, as well as to signals from trusted media. We utilize state-of-the-art machine learning algorithms to analyze complex data and extract valuable key points relevant to the evaluation of the sustainability of investments. Matter sets itself apart by relying on a wisdom-of-the-crowd approach, and by allowing our clients to access all insights via a customized reporting system, APIs or integrated web elements that empower professional managers, as well as retail investors, to invest more sustainably.
NoSQL Data Makes Analytics Challenging
Matter’s services range from end-user-facing dashboards and portfolio summarization to sophisticated data pipelines and APIs that track sustainability metrics on investable companies and organizations all over the world.
In several of these scenarios, both NoSQL databases and data lakes have been very useful because of their schemaless nature, variable cost profiles and scalability characteristics. However, such solutions also make arbitrary analytical queries hard to build for and, once implemented, typically quite slow, negating some of their original upsides. While we examined and implemented different rematerialization strategies for different parts of our pipelines, such solutions typically take a substantial amount of time and effort to build and maintain.
Decoupling Queries from Schema and Index Design
We use Rockset in multiple parts of our data pipeline because of how easy it is to set up and interact with; it provides us with a simple “freebie” on top of our existing data stores that allows us to query them without frontloading decisions on indexes and schema designs, which is a highly desirable solution for a small company with an expanding product and concept portfolio.
Our initial use case for Rockset, however, was not just a nice addition to an existing pipeline, but as an integral part of our NLP (Natural Language Processing)/AI product architecture that enables short development cycles as well as a dependable service.
Implementing Our NLP Architecture with Rockset
Large parts of what make up responsible investments are not possible to achieve using traditional numerical analysis, as there are many qualitative intricacies in corporate responsibility and sustainability. To measure and gauge some of these details, Matter has built an NLP pipeline that actively ingests and analyzes news feeds to report on sustainability- and responsibility-oriented news for about 20,000+ companies. Bringing in data from our vendors, we continuously process millions of news articles from thousands of sources with sentence splitting, named entity recognition, sentiment scoring and topic extraction using a mix of proprietary and open-source neural networks. This quickly yields many million rows of data with multiple metrics that are useful on both an individual and aggregate level.
To retain as much data as possible and ensure the transparency needed in our line of business, we store all our data after each step in our terabyte-scale, S3-backed data lake. While Amazon Athena provides immense value for several parts of our flow, it falls short of useful analytical queries at the speed, scale and complexity with which we need them. To solve this issue, we simply connect Rockset to our S3 lake and auto-ingest that data, letting us use much more performant and cost-effective ad-hoc queries than those offered by Athena.
With our NLP-processed news data at hand, we can dive in to uncover many interesting insights:
- How are news sources reporting on a given company’s carbon emissions, labor treatment, lobbying behavior, etc.?
- How has this evolved over time?
- Are there any ongoing scandals?
Exactly which pulls are interesting are uncovered in tight collaboration with our early partners, meaning that we need the querying flexibility provided by SQL solutions, while also benefiting from an easily expandable data model.
User requests typically consist of queries for several thousand asset positions in their portfolios, along with complex analyses such as trend forecasting and lower- and upper-bound estimates for sentence metric predictions. We send this high volume of queries to Rockset and use the query results to pre-materialize all the different pulls in a DynamoDB database with simple indices. This architecture yields a fast, scalable, flexible and easily maintainable end-user experience. We are capable of delivering ~10,000 years of daily sentiment data every second with sub-second latencies.
We are happy to have Rockset as part of our stack because of how easy it now is for us to expand our data model, auto-ingest many data sources and introduce completely new query logic without having to rethink major parts of our architecture.
Flexibility to Add New Data and Analyses with Minimal Effort
We originally looked at implementing a delta architecture for our NLP pipeline, meaning that we would calculate changes to relevant data views given a new row of data and update the state of these views. This would yield very high performance at a relatively low infrastructure and service cost. Such a solution would, however, limit us to queries that are possible to formulate in such a way up front, and would incur significant build cost and time for every delta operation we would be interested in. This would have been a premature optimization that was overly narrow in scope.
An alternative delta architecture that requires queries to be formulated up front
Because of this, we really saw the need for an addition to our pipeline that would allow us to quickly test and add complex queries to support ever-evolving data and insight requirements. While we could have implemented an ETL trigger on top of our S3 data lake ourselves to feed into our own managed database, we would have had to handle suboptimal indexing, denormalization and errors in ingestion, and resolve them ourselves. We estimate that it would have taken us 3 months to get to a rudimentary implementation, whereas we were up and running using Rockset in our stack within a couple of days.
The schemaless, easy-to-manage, pay-as-you-go nature of Rockset makes it a no-brainer to us. We can introduce new AI models and data fields without having to rebuild the surrounding infrastructure. We can simply expand the existing model and query our data whichever way we like with minimal engineering, infrastructure and maintenance.
Because Rockset allows us to ingest from many different sources in our cloud, we also find query synergies between different collections in Rockset. “Show me the average environmental sentiment for companies in the oil extraction industry with revenue above $100 billion” is one type of query that would have been hard to perform prior to the introduction of Rockset, because the data points in the query originate from separate data pipelines.
Another synergy comes from the ability to write to Rockset collections via the Rockset Write API. This allows us to correct bad predictions made by the AI via our custom tagging app, tapping into the latest data ingested in our pipeline. In an alternative architecture, we would have to set up another synchronization job between our tagging application and NLP database which would, again, incur build cost and time.
Using Rockset in the architecture results in greater flexibility and shorter build time
High-Performance Analytics on NoSQL Data When Time to Market Matters
If you are anything like Matter and have data stores that would be useful to query, but you are struggling to make NoSQL and/or Presto-based solutions such as Amazon Athena fully support the queries you need, I recommend Rockset as a highly valuable service. While you can build or buy solutions to the problems I have outlined in this post individually that might provide more ingest options, better absolute performance, lower marginal costs or higher scalability potential, I have yet to find anything that comes remotely close to Rockset on all of these areas at the same time, in a setting where time to market is a highly valuable metric.
Alexander Harrington is CTO at Matter, coming from a business-engineering background with a particular emphasis on utilizing emerging technologies in existing areas of business.
Dines Selvig is Lead on the AI development at Matter, building an end-to-end AI system to help investors understand the sustainability profile of the companies they invest in.