Video Architecting a Secure & Scalable Cloud Data Lake: Lessons from TriState Capital Bank April 11, 2025 | HIKE2 Building a modern data lake is no longer just an IT project—it’s a strategic foundation for scaling analytics, AI, and business decision-making. At Innovation Summit 2025, TriState Capital Bank’s data leaders shared how they successfully architected a secure, scalable cloud data platform built for both operational needs and future AI capabilities. Their firsthand lessons highlight how to design for flexibility, governance, and trust from the ground up—essentials for any organization navigating today’s data-driven future. Watch the full session below, and start with these key insights: Key Takeaways: A Secure, Scalable Data Lake Enables True Business Transformation By consolidating all structured and unstructured data into a centralized platform, TriState created a “single source of truth” that improves reporting, analytics, and operational efficiency—critical steps for supporting AI initiatives and federal regulatory requirements. Tokenization Is Essential for Safe AI Integration Before layering AI over their data, TriState implemented tokenization in-memory to protect sensitive information. This approach allowed them to experiment with AI models safely, ensuring that no private or regulated data could inadvertently leak into external systems. Governance and Data Normalization Must Be Prioritized Early Cross-functional collaboration helped the team create a normalized data model and shared business definitions, reducing confusion and improving decision-making across departments. Strong governance frameworks were key to managing schema drift, AI bias, and ensuring long-term data quality. AI Readiness Requires More Than Technology—it Requires New Processes TriState’s AI proof-of-concept emphasized the need for internal AI review boards, model monitoring, and retrieval-augmented generation (RAG) techniques to ensure safe, effective use of AI on enterprise data. AI governance isn’t optional—it’s foundational to responsible innovation. Hello everyone, my name is Carlos McIltrot. I’m the data technology platform manager at TriState Capital Bank. I’m joined by two of my colleagues Tommy who’s a a solutions architect and Benji who’s a data architect. Over the past year uh we were tasked by the bank with um building a data lake to support analytics reporting and AI Um we were having a lot of issues at the bank with uh having access to data um not having a centralized location not knowing how to get into our data um and have a huge project upcoming um for the the 100 billion uh threshold initiative which requires more uh deeper reporting for the Federal Reserve So um that’s kind of what drove this whole discussion So I’m first going to go into uh kind of what a data lake is and give people a little bit of background on on that Um I’m going to talk about what we did at Tri-State for our data lake and what we call the common data platform Um and Tommy’s going to take it over uh talk a little bit about best practices and maybe why you want to implement a data lake or something similar in your organization Um and Tommy’s going to talk about how you can enable AI over top of your data lake um some of the the things that we’ve worked on internally um to give that AI capability to our people and really empower them um with data So uh to kick it off like first what what is a name um I feel like a lot of people throw out a lot of terms around data Um and maybe the people who aren’t down in the weeds working on these things uh it’s it’s not very clear So first of all data is a single centralized repository of your data Uh you’re no longer going out to set disperate applications and different areas to pull in things that you need Um you go to the one-stop shop it has everything you need from all of your applications and sources Um it’s going to support all types of data structured unstructured and semistructured Um and that’s kind of a little bit of tech speak for um things that are organized or unorganized So say you have something living out of the Excel spreadsheet Um that would be like your unstructured data And a data lakeink can support all of that from that all the way up to a highly structured and normalized database And in your data lake um it’ll capture your historical data and keep old data for you So um if you ever need to look back at something that happened a long time ago um you’ll have that capability there And the reason why you might want to do some of these things um is for for analytics Um data is now driving everything that we’re doing in today’s day and age Um and to really power those analytics properly you need the data to support that Um and on top of that I’m sure everyone’s hearing about AI and machine learning Um and to really power those models and get the value that you can draw from them uh you need some place to defeat it And that place can be your your data lake So a quick little comparison again just uh to provide a little context What’s the difference between a database a data warehouse and a data lake um I’m not going to read this whole slide to you That’d be pretty boring Um but the high level important thing is that a data lake again it allows you that unstructured data support Um it really scales with your organization and your level of maturity So um where like a database it supports a single application It’s highly structured um the format set up front where a data lake uh supports what we call schema on read which means that you don’t have to define what something should look like and what the data needs to look like whenever you’re pulling it into your data lake Um we typically talk in terms of ETL and ELT uh which stands for extract transform load or extract load transform Um and our data lake supports the latter So you pull the data out of something you dump it into here you don’t really worry about it and then whatever you need that data for you transform it to support that Um so it’s very highly flexible Um as your organization grows it really likes to grow with you Um and it can support a lot of different initiatives that you don’t get the same flexibility with the other two types of of data stores Um so moving on talking about the the common data platform this is the data lake that we’ve implemented here at TriStage Um I’ll try to keep this a little bit light on the uh the tech speak I emailed it from my wife and she called me a nerd So um try to keep it as light as I can Um but we build this all in Azure cloud Um the the three main cloud providers Azure Google um AWS they all have essentially the same similar type of products Um but we own Azure just because that’s what we use at Tri-State Um so our other teams already heavily integrated into that and to provide them with the easiest possible access We went along with that same cloud solution So a few key parts uh of that tech stack Um the data lake gen 2 storage that’s Azure’s type of data lake storage Um it essentially is a a real fancy term for just what looks like a Windows hot system So people are really familiar with it Even nontechnical people they can go in there they can see a folder that’s named after your application and they know exactly what’s in there um gives you really a lot of flexibility by supporting again any type of file type any type of structure um allowing you to just dump everything that you want to in there Uh we also use function apps which is um essentially small bits of code that are written to do certain things Uh primary drivers of that for us are change data capture So a lot of data systems are able to detect their own changes and sometimes those changes are really important to an organization Um maybe you want to know whenever a customer gets updated or changes their address Um these are all things that can help you make better business decisions So being able to track those changes um is something that we thought would add a lot of value across the bank So we have enabled that through function apps Um another thing we’ve enabled is a process called tokenization And tokenization is essentially when you swap the value of a sensitive data point with a randomized data point So if you have for example a social security number you would swap that for a randomized social security number And what that does is it protects your data so people don’t have access to those true values And this becomes really important when you start layering AI over top of your data lake because AI harvests your data It trains on your data It pulls your data out And to protect your data from that we use the process called tokenization And Benji is going to talk a little bit more later about the work we’ve done related to that Um we use data factory as an orchestrator Um and essentially what that means we use it to kick off processes and schedule things and let people know when things are happening Um we took a really highly parameterized approach to this We set out the goal that someone will be able to onboard a new source into the data lake in one day Um and we’ve accomplished that goal We actually had a new hire um just a few months ago He had no uh previous experience with any of this fresh kid out of college Uh we kind of dumped this on him and said “Hey can you figure this out in a day?” And he was able to do that Um so we thought that that was a really great win It allows you to be highly flexible and again moving constantly with the business and doing what they need to do within the AI And lastly the last key technology we adopted is called Synapse Um it’s sort of like SQL if anyone’s familiar with that um allows your your more techsavvy business partners to go in and query out their data as they need So they need something really quick Um they don’t need to work with the tech team They can go in and get the data out on their own Um so for report writers and people providing analytics um those type of of business users um they’re empowered to do things that they previously couldn’t do before Um we also adopted what we call medallion data architecture Um if you look at the the diagram over on the right you can see we have the bronze silver and gold layers within the data lake We also call these raw refined and curated Um essentially what this means is that at each level of data you continually refine and enhance that data so that it provides you with better business capabilities Um so your raw data just looks like something that you see that developers would use in the database And your gold data is something that an executive leader can look at at a glance and quickly understand and gain value from that data to make quick and effective business decisions And then on top of that again the diagram on the right you can see we’re layering over analytics reporting AI on top of this So that’s really the end goal the outcome of all this Um it’s kind of an intermediate that a lot of people don’t really get a view into but the outcomes are tangible and they’re visible um to the business leaders at the organization And with that I think we’ll hand over to Tommy He’s going to talk about why you might want to do this for you Sure Drop that So keeping it simple here Um number one is the consolidation of data sources as Carlo kind of alluded to just a single source of truth So in many organizations you have data spraw You have data that’s in many different locations in many different formats This is just a way to consolidate it into one singular place So again to Carlo’s point an executive leader can look at it and understand quickly I understand what this is where this came from and what the implications are of using this data um single source of truth is really a universal buzz word you hear a lot of times when people talk about these types of platforms but that’s been a guiding principle of ours going on down this is something that I’m more passionate about is the process so the curation process between bronze silver gold for us tech people it’s raw refined and curated what happens between these layers what do the layers mean how do you engage with the teams that are doing this data engineering making these business decisions something as simple as what does a customer mean these are things that we define in these layers that are a byproduct of you know these structured environments as well Um along with this is it removes the business logic the logic excuse me from these analysis tools So if you’re in a PowerBI a Tableau whatever you may be familiar with instead of building logic into a dashboard and then sending it to someone else they instead don’t have to do that in the tool Anyone that pulls in a customer an account you know that they’re dealing with the same thing in the same scope context level of granularity which is a problem I know we’ve observed in our bank anyway So it seems to solve a problem there and makes our analytics users more effective Uh data modeling So it garbers transparency into the source data and how systems interact It’s an accurate reflection of the way these systems exist in reality If it’s a customer relationship management system like we have uh lending into how are these accounts populated in these databases that our applications use you’re at least able to track that lineage and understand what the business process is through the data So it’s intended to be an accurate reflection of those items as well And moving on down here um representative of business practices I kind of alluded to that just now but in the same context your business processes you want them to align with the data that’s output So if it’s a source system to a destination if it’s a curated data set that captures one specific line of business or portion of the business it’s something that you can easily do here and is easily replicated and you can share out Again find an efficiency there Maintenance is something that unfortunately we’re all probably familiar with Um these things take work to build but they’re even harder to maintain over time There’s knowledge gaps you know tribal knowledge So you know along all the best practices and the documentation and stuff that we can aspire to have um modifying these curated data sets and whatnot the way that our platform is built is to accommodate schema drift as well So if a source system changes it’s natively created so that it can accept that we’ll say and then use that moving forward So there’s not impact to downstream systems It doesn’t break everything if you add a field to a database is one way to say it Uh data governance which Benji and Carlo have been more involved in than I have but it’s a very important step here Um helps govern the quality of the data the data lineage the data ownership all these things that are perhaps not the most exciting things in the world but very important when you really drill into it I need to backtrack data quality issues or whatnot you can get ahead of it by having an effective data governance strategy like we’ve been implementing Uh finally patching other upgrades Kind of put this in as a plug for the ways of the past My life my early career anyways was riddled with server and database upgrades and stuff Something always broke It was remarkable But in this sense everything we’re doing is serverless for the most part Um you know the function apps the Azure data factory it’s we’re not residing on you know operating systems and certain versions It’s largely serverless technology that requires minimal maintenance in that regard unless we update something like a Python you know version or something of that effect which fortunately we haven’t run into yet but it minimizes those items which is taking advantage of the cloud You can go to the next one if you want Okay that’s cool Uh I’ll scoot over here just to get some visibility So uh thanks guys Um what would be a cool presentation about data without AI like it it’s the new thing Everybody wants to talk about it It’s cool It’s great There are some challenges that people don’t really think about when they build AI capabilities seems to be your typical data warehouse or in this case a data warehouse So no pun intended let’s just dive into the lake using AI Um the main focus here is what we did when we decided as a bank to build a data lake Um our leadership looked at it There were conversations around AI um and you talk about what can you bring in house LLMs you know GPT engines all these cool conversations So we decided to take a stab at building a quick PC to kind of understand the things that we have the level of effort the key outcomes and our journey if we decided to embark on it So pretty much this slide is just about um our journey so far what we did what we built um what our focus was and the you know final outcome of the P that we created and um I wouldn’t want to kind of jump into all the cool AI buzzwords and cuz you probably have heard it over and overall between yesterday and today but we already had an API agent who um like Carlos said that’s where our data lake was built One of the things that we also had to kind of provision in the process was AI studio um which kind of facilitates this version of an engine or driver that kind of facilates facilitates you building elements whether you want to go the JPT route or you want to go the lamb which is I believe that’s met route so we have to provision that and um the other thing that we also had to focus on is a search um we put a pin in that because that’s key to some of the things and conversation that has been coming up with AI Um at the end of the day one of the things that we were super focused on was you know security that is there security um the level of effort and also you know where do we go after this P is done Um it doesn’t sound great We ended up shelving the project we you know got a chance to understand the key things that we needed but we decided not to go to the next stages and the next steps in operationalizing Uh so the key outcomes of it in all was pretty much kind of baked into the concept of tokenization Uh one of the things that we realized as a a data organization and most importantly a bank that is highly regulated is um the possibility of your data getting into bad actors or your data getting fed into an LLM that is posted somewhere that you have no idea Uh the best part that we had we did not have to overcome was our process is what we call tokenize in memory So we tokenize our data in memory using sky flow and then when we land it it’s already tokenized No one get the chance to see any sensitive data No one get a chance to know who my name is because it comes out as something random U so we already had that part figured out It was easy for us to just go feed our data to these LLMs and let’s see you know what happens right because now we are not overly concerned that who gets their hands whose name is out there and which clients we are kind of reaching um the idea out there Um one reason that I also mentioned that it’s great to kind of stick a pen in the whole AI search is um rag I believe most people probably have heard the whole concept of retrieve augmented generation It’s all this cool thing It’s about um how do you make your data not hesitate how do you make your data effective part of the LLM kind of pick up all the important things and u not to go through the whole concept of what is a high level architecture but our focus was pretty much building vector databases so there’s something called vector indexing there’s embedding and what we call the retrieval layer the prompt engineering kind of explains itself you know the more you ask good questions the AI gives you great feedback but the retrieval layer was something that we focused on because that’s where Azure AI search comes into play You get the chance to actually index your data Um you get a chance to build a semantic layer into your embedding models and then you finally have your Azure you know AI search which is your retrieval layer that includes your queries and improves your timing and it was perfect for our use case right we have loans across multiple um people We have um um recruitment finance we have our data literally all over the place But building that layer makes it easier for you to um create box information that is indexed and then you improve your retrieval process which also again helps reduce hallucinations improves accuracy and all of that Um the final thing that we also picked up was uh AI governance It actually put me on a path where I ended up buying a book to read about it Probably read like I don’t know 20 pages out of 500 pages and I was like this is boring I’m done Uh but the key thing was uh with AI governance uh one of the things you have to understand that there is bias data bias It’s it’s right out there There is the possibility of data drift Um your sources of data might change over time A schema drift happens you have to pay attention to all of that So the whole concept of AI AI governance is something that I still think um across the entire AI world we are still familiarizing ourselves we are building tools um there is Microsoft fair but for a bank one of the things that we decided that in that event or in future we decide to take on these steps we have to focus on what I uh what we came to term as IRBs that’s internal um review boards like this group of people or this group of um um brains will sit in a room try and build some internal governance for our AI so that we don’t have the possibility of data drifting and then hallucinations and all those things that we don’t want we don’t want to build any bias in our data again banking data so it makes it a little bit tricky um the last thing was model monitoring with explanability Um there are multiple tools out there I think one that is really kind of gaining traction is shapely Um they are your typical come third party Microsoft is also trying to get in the game um like come in and help you or actually help train your data that your data doesn’t get into a position where again data bias is pretty much the biggest thing where um zip codes and account numbers and all of that can make your data actually not act the way that you want it to act And um that was one of the things that we picked up Uh I believe that’s all Uh but to wrap it all up it was a great project Pad had a lot of fun with it Did not had enough data exper uh AI experience in the process but I had a blast And guess what the screen is gone So yeah how how about that how do you end an amazing presentation but yeah you have 80 questions out there any um challenges with data normalization uh yeah he’s different So yeah so yeah that’s actually what did you do yeah that’s a huge problem you’re running into right um and there’s kind of from two perspectives One from the tech perspective of data not being properly normalized you can’t relate it And two from the business perspective of two people are talking about the same thing but they don’t realize that they are because they look at different views of that data right um so we had actually kicked off like a kind of a cross functional project where we were working with all these different data teams Uh we brought in representatives from the finance department from um operations and from all these different groups and we really sat down and said what is important about the data to you identify all of those and then took those back and built a normalized model that supports all that with common definitions and essentially um a whole book of definitions for the data lake so that everyone is looking at the same normalized view of data Um and we’re finding that that’s really driving better conversations with a business because whenever somebody says loan and someone says tranch those really mean the same thing but they’re saying different words Um there’s a level of confusion that we’ve started to eliminate through that normalization Is your company acquisited do you make any acquisitions no we do not Thank you How does uh adding AI to the lake affect it as far as uh training data or any data related to the AI being stored in the lake or changing the structure of data in the lake by the AI so its presence in the lake itself the footprint it takes up and if you’re allowing it to make modifications to the lake So how does that work is the AI on a separate space or what have you yes Yeah So that’s a very good question which I actually failed to pinpoint So um Microsoft’s um open AI studio supports you know CSVs text uh JSON and a handful right and we store our data in what we call delta power case um open AI doesn’t support delta power case so u thanks to Paulo and team they had to create a location for us um where it was kind title I don’t know PIP or something where it’s still in our data lake different folder like Carlo mentioned is Android data lake is like your typical file system right so created that folder for us in the data lake labeled it but the process was we had to change all our files that we use for the P to JSON and in the process when you want to use uh token sorry uh Azure um uh AI studio you have to chunk up the data Um so if you have 500 records and you you write that into JSON you have to probably chunk it up into 50 records into 10 different files So yeah we did have to make some transformations kind of go away from how we had built our data lake to support this So yeah in the process and we did this like a year ago uh I do not know whether Azure has made new modifications to the type of data that you can feed in but JSON was what we went to and then we ended up making those travel changes or modifications Yeah as far as like the the AI’s capability to actually like modify that data specifically though um we purely did it on retrieval to give people more power and find data but we weren’t allowing it to make modifications to data Um just because of it being in its infancy and wanted to make sure that it’s not changing some customer account information and in properly reflecting it right um but we did want people to have the ability to say “Hey what’s the history of all the customers accounts and having to pull that data back you were talking about the use of AI mention did you use it to AI to search or look at the internet to learn how to do the things you were doing andor did you use it to look at your data data at tri state and say what are trends or answer questions Yeah So we’re primarily looking at the ladder right i think when a lot of people think about AI um they think about hey I’m going to go ask Chat GBT a question and it’s going to go pull off the internet or pull off some publicly available database and pull data back What that doesn’t give you is the ability to have it analyze your own data Um which is what this provided So it’s actually looking at our data specifically It’s an isolated element that’s running on top of the data lake and able to pull our data back Um so before this uh like whenever someone asks a question about account they copy some account information drop it in chat GPT um and obviously being a bank um highly regulated industry very sensitive information um leaking that data out and not knowing where it goes that’s that’s a problem right so whenever you have your own isolated data source to put the element on top of you can have it give you that level of granularity with your data where it’s specific to you and not specific to whatever it found in its Google search um but keeping that same the same time So thank you guys for listening Very much appreciate you Latest Resources Article Wodzenski’s Viewpoint: Preparing a future-ready workforce is critical in the era of AI Originally published by Pittsburgh Business Times Story Highlights Pittsburgh has long been a city defined Read The Full Story Article Navigating 2025 Trends: Insights with HIKE2 Experts As we move into 2025, the pace of innovation in Cloud, Data, and AI continues Read The Full Story Stay Connected Join The Campfire! Subscribe to HIKE2’s Newsletter to receive content that helps you navigate the evolving world of AI, Data, and Cloud Solutions. Subscribe