Fueling Your AI with Data Lakes

Data lakes are beginning to represent the heartbeat of organizational operations and hold the key to optimizing your workflow, while also making accurate, data-driven strategic decisions. Sensor feeds, system logs, real-time data movement, analytics, and many other datasets comprise a data lake. Their data is stored in various formats and in large quantities. As the data is often in its native format, any number of operations can be performed as it has not been trimmed down to a subset of fields for a specific purpose. While it’s cost-prohibitive to assign a team to transform anything but the absolute critical sets, AI is well-suited to the task and is optimal for performing invaluable analyses to drive business transformation. 

Using Data Lakes

Data lakes can be comprised of data in both structured and unstructured formats. Typically, the subset of data necessary for standard operations is distilled into user-navigable formats such as CSV files, spreadsheets, or databases. While this pattern maintains the status quo, the rest of the data is left untouched and little insight is gained from the wealth of data. Allocating human resources to transforming the rest of the data into usable formats is a costly procedure that may yield unknown rewards. It is important then to ensure that data lakes are structured effectively for an AI algorithm to consume efficiently. 

Using Metadata

The first thing to do for optimizing your data lakes for AI consumption is to add relevant metadata to describe the file contents. While the files themselves contain useful data, understanding where and when the data originated is just as useful. Depending on the file type, metadata can be applied to the file itself or added to the data contained within. The more descriptive metadata is, the more connections AI can make. For example, if a file contains logs from a weather sensor, it would be useful to know the sensor function (wind speed), which parent system the sensor belongs to, its geographic location, the date range for which the log entries pertain, and the software version. By adding this type of additional data, better correlations can be made to make strategic decisions. The more correlating references the metadata contains to other data (as mentioned in the next section), the easier it will be for AI to analyze.

Organizing the Data

Once the files accurately describe the contents, they need to be organized into an easily-traversable convention. Grouping by file type, parent system, and overall system goal will allow the algorithm to be tuned to quickly identify patterns and access related information. The following pattern would work well for organizing the data lake into a format that AI could easily traverse:

- Source

- Data State (raw, parsed, integrated)

- Subject Area (weather reporting, user interaction, site analytics, etc..)

- Date

- Parent System

- Reporting Node

The top level is the high-level source of the data. This could represent a standalone data lake or a third-party integration. The data is then organized into one of three states, “raw”, “parsed”, and “integrated”. All incoming data will be routed to “raw” and will be moved to “parsed” as the algorithm compiles the data into user-friendly files. Data to be integrated into established data warehouses will be further transformed into relevant reports and compatible data structures. Within these stated folders, the data is grouped by subject area. The data is further grouped by date. Finally, the data is grouped by reporting node and its parent system. If the system is complex enough to warrant sub-system grouping, this would be useful as well.

It is vital when applying the metadata and organization to be consistent with naming conventions and references. While an AI algorithm can be padded to work with quantifiable inconsistency, this will be a constant source of frustration and potential inaccuracy if the conventions are not strictly followed. 

Leveraging the Optimized Data Lake

Referring back to the weather example, it is possible for an algorithm to identify exceptionally windy days and correlate the findings with decreased/interrupted signals as antennas sway. This may lead to a business decision for adding reinforced hardware to decrease the degradation in service quality. While a human or a team could identify this hardware deficiency from the raw data, it would be akin to finding a needle in a haystack and lack the promise of an impressive ROI. In contrast, using AI to identify correlations such as this is extremely cost-effective. 

AI can identify outlying cases and trends from the mounds of data in your data lakes, but only if the data can be consumed effectively. If you currently have a data lake or are interested in adding one to your organization’s toolset, consider carefully how the data is organized and stored, as this first step will prove to be the most imperative for any AI project. While AI algorithms are unequivocally the best tool to analyze data, the data must be organized in a manner well-suited for an automated process. 

Liquid Analytics works with clients to deliver AI decisions that provide high ROI for business initiatives. Contact us to get started today.