A lot of the CluedIn Team works with data on a day to day basis. Whether it is cleaning data, or integrating it – our day is consumed with working with raw files. We also work a lot with our customers onsite and we see a trend in the data space that we would consider counter productive and is also the reason why we don’t support this practice within our platform.
The practice I speak of is, pulling in data sources from Excel, CSV, TSV or files directly. It unfortunately is the normal practice today that when cleaning and preparing data for use cases, that the data engineer asks for an “Export of the data to work with”. At CluedIn we are on a mission to rid this from peoples vocabulary completely. It doesn’t help that the majority of platforms available all support this, let alone make it the primary choice.
Let’s break it down.
1: The first problem is that as soon as the export is generated, it is now static data. Data is not static in nature and there are not enough benefits of working in this way to outweigh the downsides.
2: If a data engineer is to clean this data, it never makes it back to the source so that the rest of the business can gain value from the cleaning that was done.
3: There is no data lineage that can be practically tracked from this practice. Technically, it can – but the implications to do this are not in reality today. This means from a governance perspective that a business has no idea what data their employees have on their own machines that could be sensitive. No contract that a data engineer signed in the world can protect a business from a breach if this data was leaked. It would not be an excuse to say “But our engineers are told they can’t do this”.
4: The data engineer gets what the data owner gives them, not the other way around. The data engineer should have easy access to the data and be able to project out what they need to do their job. As long as proper access control is in place and given to that Data Engineer then we remove a lot of the middle work.
5: If we are to push this data back to the source, then it needs to fit into a purpose fit model per source. This does not scale. Things that don’t scale, don’t get done.
6: Why does the data engineer even need to ask for the data? They should only ask or proactively be given permission i.e. “Hi Sarah, I need you to build a model on Customer Churn so I have given you permission to the data”.
How do we help?
The value of CluedIn starts right at ingestion of data. We have over 150+ prebuilt integrations to databases, saas tools, data lakes and streaming services which allow a developer to simply authenticate with that source to start having access to the data through CluedIn. Most importantly, this source is now in Sync with CluedIn – so new changes in the sources will propagate to CluedIn.
Data cleaning and preparation tools are provided directly within the CluedIn Web Application. If you want to pull this data locally into Excel, R or Python then we expose GraphQL for developers to query the data – but most importantly, we provide streaming of this data so that lineage can be tracked. This simply means that instead of querying the data through CluedIn, you setup a stream from CluedIn to your target based off a Graph QL Query. Because of this, there is a realtime lineage that is available between CluedIn and your target tool. Streaming is not supported by all platforms, but definitely by the popular data tools such as Azure ML, RapidMiner, Qlik, PowerBI and programming languages like Python, R, C#, Java etc.
This solves the whole problem of offline data cleansing only adding value to the users of that data. If we clean in a central place, then anyone else that uses that data gains the benefits that others have done before them.
This is why when you go to add an integration to the many sources available within CluedIn, you will never find “Upload CSV, Upload Excel” etc. Instead we would rather you upload that file to a provider such as Google Drive, Dropbox or OneNote (even your local File Network) and integrate it through there to still get the same value – but in this way CluedIn can help track the lineage of where certain data is located. From a GDPR perspective you can see why we are pushing this ethos.
At CluedIn we are about connecting the enterprise and our goal is to help companies rid the disconnected nature of business practices so that they can start gaining value from their data.
One step at a time.