On-Demand Webinar

How a Fortune 50 Healthcare Services company realized data transformation in the cloud with Prophecy and Databricks

A Fortune 50 Healthcare Services company is leading the way in data transformation in the cloud. As a health services innovator, they strive to provide better experiences through technology and data. With a team of hundreds of data practitioners including engineers and analysts all with diverse skill sets, and tens of thousands of jobs run daily, their recent move to a cloud-based data architecture presented them with significant challenges, particularly in terms of standardization, user productivity, and automation.

Watch this webinar to address these challenges, they worked closely with Prophecy to modernize their frameworks, empower users on Databricks Lakehouse, and migrate thousands of jobs to the cloud. This collaboration has improved the efficiency and efficacy of data engineering processes, enabling them to deliver superior healthcare outcomes. This discussion will delve into the specifics of their collaboration with Prophecy and how their low-code data transformation platform has helped them overcome the challenges of transitioning to the cloud at such a large scale.

  • Learn how a Fortune 50 Healthcare Services company overcame challenges in transitioning to a cloud-based data architecture with Prophecy's low-code data transformation platform
  • Discover how modernizing data frameworks and empowering users on Databricks Lakehouse can improve efficiency and efficacy of data engineering processes
  • See real-world examples of how the collaboration between a Fortune 50 Healthcare Services company, Databricks and Prophecy has led to superior healthcare outcomes, and learn how you can apply these insights to your own organization

Transcripts

Maciej

Hey, everyone, and welcome to our healthcare webinar. Even though one of our primary speakers has not been able to make it, don't worry, we are going to make it up to you. I have been deeply involved in deploying the solution to our healthcare customer and transforming their data ecosystem. Walk through the why, the challenges, and the results they've encountered. Franco is going to speak at length about some of the technical integration details with data bricks. But definitely make sure to stay till the end as we're going to show you a complete demo of the platform where following the current trend of generative AI models and connecting it with the healthcare beam, we are going to build a complete medical assistant for healthcare professionals in just few minutes on prophecy and databricks. So definitely make sure to stay tuned for that. Okay, but let's deep dive. So as you all realize, building an enterprise great data platform is an undertaking of tremendous scale. You always think from the perspective of your user. And in our case, the users of the platform are the engineering teams, business teams that make decisions for the company to succeed, but also the medical specialists that seek to improve the patient outcomes.

Maciej

Therefore, the platform team not only has to design, maintain, and build a robust data platform, but also has to ensure that it's incredibly easy to use, even for the non technical consumers. The platform team is tasked with this almost impossible challenge, where on the one end they want to build a system that is at the very edge of the state of the art, and on the other hand, they have to build something that is accessible to the wide medical audience. Speaking about the state of the art data solutions, as you can imagine, that's not where any of the large quantum enterprises start from. They build a tremendous systems that very often have been hardened through decades of hard work and experience. Just a few years ago, all of those systems have been working on premise, and we're using proprietary and very expensive execution engines. They've served them very well. However, they also lock into this limited and very expensive ecosystem. The ecosystem for which you have very difficult time hiring for as the skills to maintain them are no longer taught in universities and certainly not in any online course. The new engineers are not excited to maintain them either.

Maciej

Additionally, those systems fail at the ever growing scale of data and are really not flexible enough to the changing demand. Finally, there is the data privacy and compliance demands. Our nations are making tremendous amount of progress towards more secure and considerate environment towards data privacy. But at the same time, that means more and more compliance that our teams have to satisfy. The legacy architecture is just too rigid for that. Here's a quick, very high level snapshot of the on premise architecture. The details of it are, of course, something that we can't really deep dive into, but the general graphic represents the system very well. This is the traditional ETL setup that might look familiar to what some of you might have seen. However, let's take a slightly more closely look at it. On the left hand side, we have a tremendous amount of various data sources. They vary not only in the type as we have live streams, legacy Cobol based files, traditional operational databases of sources, but they also vary in the quantity and their purpose. Thousands upon thousands of sources are ingested daily. The main ETL engine of choice here is a Ben ratio, which powers the core transformations, but it's also using a proprietary and purpose build system.

Maciej

There is also an inhouse build framework that is able to handle the data of any complexity, clean it, process it, and merge it to the master data. All of that main logic is stored within those business rules driven configurations. You see, even though the underlying framework and source code is very complex, even the business users can define their configurations on top of it. So it's only done, of course, in a very limited and sometimes hard to scale way. All of this has been built upon hundreds of thousands of lines of code. At the end of the day, when the engine is done, the analytical facts and dimensions tables are written back to our analytical database of choice, Netease. From which further business units can create their specialized reports. This whole machinery that this Fortune enterprise runs on is running on a very tight, of course, daily schedule. And if there are any files missed, records and process, jobs failed, the medical insurance claims might not be resolved on time. So with all of those challenges in mind, the data platform team has set out to re architect and completely move the existing system to a new set of technologies.

Maciej

Before diving into the new stack that they've settled on, let's look at some of the high level goals. So first of all, we wanted to realize the cloud strategy. Cloud allows us to manage the load in a very flexible fashion to seamlessly respond to the demand. Instead of purchasing expensive hardware and diligently planning out its usage, we can just keep adjusting the request load at a very fast pace. Second of all, we wanted to enable our team on best open source technologies. More and more straight out of the College engineers are familiar with Spark and the modern data stack. Even though sometimes the learning curve is very steep, it's becoming further standardized. And additionally, there is tremendous amount of tooling, especially on the low code side, that can help you further increase the productivity. Finally, they wanted to meet the SLAs. It doesn't matter what is the load given for the day. All that to deliver high quality data products to the users. Okay, so how did they plan to get there? Of course, they've done their due diligence and significant stress testing of various platforms. They've done tremendous amount of POCs. And solution they finally decided to choose is databricks in the Lakehouse architecture.

Maciej

Because databricks is based on Spark, it gives them a breadth of the ecosystem of well maintained but also end to end enterprice backed open source technology. Some of the innovation in particular that they've decided to leverage is, for instance, Delta for efficient file merges and Photon for high performance compute. Unfortunately, one of the trimmely challenges is educating the existing teams and engineers on the new data stack. Those are the folks that have been really productive on their old on premise, very often visual technologies. And databricks with Azure is very powerful. Harvard requires a pretty steep learning curve. Therefore, they've also decided to use prophecy as the low code layer on top of the low level data platform. Lastly and most importantly, they needed to figure out how to take all this existing workload from one of their existing on premise environments and move that over to the new data brick spark world. They decided to embark on this journey to migrate all the existing data jobs and assets directly to it. One of the choices that the Fortune 50 healthcare company has made is, of course, data bricks, and Franco is going to give you a lot more context on that.

Maciej

But just looking quickly at why did they decide to use prophecy? Well, prophecy, of course, when you're going with data bricks is a natural choice. We have a very strong integration with Python, unity catalog, and Delta, all the technologies that the Fortune 50 healthcare company has decided to use. And we also have a very easy low code visual editor. The whole idea is that your team can just start building the pipeline seamlessly by dragging and dropping gems that turn into high performance code on GitHub. Really, the only knowledge that is required for you to be productive on top of the platform itself is the knowledge of the basic Excel functions. Additionally, the platform itself is super extensible, and we're going to deep dive a little bit more into the framework part, but it's very easy to build on top of it, create new gems that then encompass some of your existing functionality. And also, also, since the platform gives you high quality code on the other side, it integrates really well with your existing CI CD. For the CI part, you can still follow all the best software engineering practiceslike unit testing, building your data pipelines, and scheduling. Finally, on the continuous deployment side, you can take and deploy your data pipelines directly to whatever environment you'd like.

Maciej

That was a little bit about the low code solution for the development itself. But as we mentioned, the healthcare company also decided to migrate all of their existing workload directly to the cloud. That is very difficult. As we mentioned, they have thousands and thousands of different configurations, hundreds and thousands of lines of code, and this whole purpose built inhouse proprietary framework that runs for them. You cannot just do it manually. You can't take and rewrite all of this code. It doesn't matter how many people you would try to throw at that particular problem. So it is really impossible. Well, this is where Prophecy introduces the Prophecy transpiler, which is able to take your ETL codefrom solutions like Altrix, Informatica, SSIS, Abilio, and convert it directly into high performance Spark on the other side that further on you can modify and edit through the low code visual tool. Our team has spent years building it and used tremendous amount of experience from the previous companies building low level compiled technologies in enterprises like NVIDIA or MathWorks. What's also really important about this transpiler is that it not only converts your code one to one, which sometimes might lead to duplicate code, but it also is capable of converting frameworks.

Maciej

You can start by converting your framework, turning it into a Spark based framework that will support all of your requirements from data quality, security, monitoring, auditing, etc. Then start converting the configurations on top of that service load. Okay, awesome. Speaking about the framework, here is a little bit more information and depth of why majority of enterprises decide to build frameworks for the data platform teams and how they really enable them to be more productive. Sorry. Any pipeline developed by business users should always meet the tight cost and time SLAs. This is tremendously important, but very often what happens is those business users don't exactly know how to fine tune and optimize their code. Therefore, you want the framework to be there for them to serve as the foundation on which they can start writing their business logic, but the framework will take care of the maintenance of the cost and time SLAs. The framework also has to be composed of small reusable services. We've mentioned those JMPs and you're going to see them how they work in the demo directly. But those JMPs functions essentially as componentsfrom which you can build your pipelines of varying functionality.

Maciej

They can handle the ingestion, protection of the data, transformation, and many other use cases that you're going to see very soon. Since we are running a transpiration project, the financial integrity and incremental updates are very important, and so the framework has to support that as well. At the end of the day, after you're done with the migration and your pipelines are running in production, you do want to make sure that they are running really well. Health dashboard, status notifications, complete auditing and telemetry is incredibly important. Another aspect that that company was considering and putting a lot of weight on is the portability of the produced code. At the end of the day, databricks gives you a tremendous spark environment, but even they upgrade their versions very, very often. You want your code to be able to work on the native spark, and you don't want to have to maintain it over extended period of time. periods of time. It has to be as portable as possible. And all of that code should live within the best development practices possible. All the best DevOps with best software engineering. So that means that it should live just as a standard repository on Git, from which other people can fork, use it as templates, and follow all the prebuilt standardized CACD rules, while still making it accessible to folks who might not know that much about Git in the first place.

Maciej

Okay, so those were some of the requirements for the framework. Let's look at what that health care company has built. They've leveraged some of the best capabilities of prophecy to create this framework. On the left hand side, you can see that there is a platform team, a data platform team that has built a project, just a complete repository for the framework itself that contains your generic pipelines. It contains the Gingestion pipeline, it contains the transformation pipeline, and some of the basic templates for the reporting. It also contains basic raw data sets from which other teams can start building on feathered down reporting. Speaking about the data bricks' architecture, they often call this Bronze Silver Gold. Some of you might be familiar with it. This is setting also the Bronze layer itself. Additionally, that framework features a set of reusable gems and transformations that other teams can start leveraging so that they don't have to rewrite the core business transformations every single time they are trying to build a new report. Now, further on, the platform team has built this. Also now the business teams can start taking this framework, starting to enhance it and produce their own specific application code.

Maciej

This is also where the transpiler comes in, where it's actually spitting out the code compatible with the framework itself. What's the result of that? Well, you end up with a very standardized code and your business teams are able to deliver significantly faster. Let's keep going. Perfect. Now, we looked a little bit about the framework, which is a very deep dive of the code structure itself. Let's look a little bit at the architecture. Where does Prophecy databricks and get and all the other systems fit together? And how do they play with each other? Well, it all starts with the transpiler itself. As we mentioned, they underwent the journey of converting the proprietary app initial business logic directly to Proficy compatible, Spark native, open source code. So the very first step had to be to use the transpiler to convert the code onto Git, test, optimize it, and make sure that the underlying frameworks work with it. All of that happened directly in the prophecy ID where you can see the interactive debugging as you run through pipeline, you can see the interactive data, you can build all of the gems and reusable templates. That code has been directly stored on the Git, and then using MACD propagated on both of their GitHub and then using CICD propagated on both of their development QA and production environments.

Maciej

The Java have been stored natively in their artifact repository, and that's something, of course, that both databricks and prophecy integrates really well with. As your jobs start running and processing the data from the on prem systems which were sent to the cloud by copying them over, they are then starting to run on top of your databricks, run time with prophecy developed jobs and doing the majority of the process in there. The new Web house of choice has been Azure Signups where some of the data then is consumed for various purposes for the business teams. Perfect. So that was the architecture. Now, what value has all of that project provided to that huge enterprise company? Well, first of all, the first project is already behind them. They've been able to convert it. It's been tremendously successful. They've done it significantly faster than it was initially planned. Now, they have many teams working in parallel, many business units working in parallel to be able to convert their own new project. The first team is already running in production. All the other teams are starting to get there very quickly. They've taken this approach where after the very first team has been productive and enabled, that means that the framework and the underlying platform works really well.

Maciej

Now you can start parallelizing and going fast to many other teams. This would have never been possible without the architecture that, of course, involved both prophecy and databricks. And that has provided them a significantly faster data transformation time. Tremendous amount of jobs are already in production. As we've mentioned, around 500 of them, and all of that is sitting on this Unified Data and Analytics platform. Okay, so just summarizing that really quickly, Local tooling enables all the data members and boost their productivity significantly. They were able to maintain a really high quality code base, thanks to it, because at the end of the day, everything has been built out of the standard based framework. Using databricks lay out the Unified, the low code approach with their data analytics and all the other future AI use cases. And they've been able to successfully accelerate this whole cloud strategy that has been tremendously important to the success of their business. The ability now to deliver the data faster to their stakeholders had a direct impact on their decision making and at the end of the day customer satisfaction as well. Franco, off to you.

Franco

Thanks, Maciej. That was a great introduction. Absolutely, the partnership between Prophecy and Data bricks is amazing. One of the basic things about data engineering, especially low code data engineering, that most customers really appreciate is the ability to visualize it or to be able to interact with the system via a GUI. This is something that data bricks doesn't really have, and it's great that prophecy does. And it integrates very well with the Lakehouse platform. Essentially, one of the things that Maciej talked about earlier was that the transpiler and how it transpiles code from all of these old on prem systems, these legacy systems that you might have been using in the last generation, and then it refactors it for the cloud using databricks or it's part code. This is what I call, or what some of us in the industry have been calling, minimum viable refactoring, because you can't just take on prem code and shove it into the cloud or do it some vendors called lift and shift, because you're metered on everything in the cloud. It's different from on premises because you owned the hardware and the network and you licensed the software, you could use it however you so fit.

Franco

In the cloud, you're metered on every bit switched, on every compute cycle, and how much data moves around. So typically, these on premises architectures don't mix well when you go into the cloud. With prophecy in the transpiler, it's excellent because it actually can automate a lot of this minimum viable refactoring for you upon ingestion. And that's where prophecy fits. It basically is what I call the paint of glass that sits in front of the code for your users. Essentially it has Git built right in, so does data bricks with repos. So it's great that we all use Git and CI CD patterns to be able to practice proper CI CD patterns in the cloud. And then everything does traverse through, we call it Bronze Silver Gold. But back in the warehouse days or in the early ETL days from last generation, you might have called this raw staging presentation layer. It's all the same thing. We just call a little bit differently. But basically you drop your raw data as it comes from your source, whether it's from a file based ingestion or streaming based ingestion or a CDC based source. Then you clean that data up for an integration layer.

Franco

This is what we call silver. But this is where you have all of your data and it's clean and ready to be consumed, but it's not aggregated and built specifically for certain purposes. This is where all our clean data is. And then you have the gold layer. This is where you have your solutions. Typically, this is your presentation layer, but this could also be feature tables or feature stores. They can be anything that you would need to do with the Lakehouse. And then once you've engineered all your data into the Delta Lake architecture pattern, now you can start servicing all your use cases. The DataB ox SQL Warehouse is a serverless warehouse that boots instantaneously and is able to dynamically scale to the consumption that your BI users need to service their enterprise reporting and dashboards. And also, your data is now readily available for data science and machine learning. In databricks, we have the data bricks runtime with our notebooks in the cloud. It provides a great way to get started to start delivering value to your business with ML and AI. But we also have other capabilities as well. For your citizen data scientists, we have products like BambooLib, which is like a recipe view that helps generate code for you for data science, and other ways that you can integrate to help you get your job done.

Franco

The integration between prophecy and data bricks is actually quite amazing. When I first saw the transpiler in action, my jaw dropped because I thought it was a piece of amazing technology. What Maciej and team have done since then has blown my mind. Because essentially, to be able to, as an engineer, sometimes you have certain engineers who only think in code, and you have some engineers who primarily think in visual nodes on a plane, right? Gems in this scenario. And then there's people in between, right? There's a spectrum of people. We're all different. We're all different types of learners. Some of us are visual learners. Some of us are textual learners. And to have something that can speak both languages at the same time, go back and forth is phenomenal. To be able to take your code and represent it visually super simply is a great way to bridge the gap between users on the Lakehouse with data bricks. And you can also now, I don't want to steal a whole lot of Modix thunder for later, but these connect now to our SQL warehouses. And that's amazing because the serverless SQL warehouse provides amazing value for our customers.

Franco

And now to be able to visually create through JMPs on a graphical interface, a SQL ELT or ETL paradigm that then persists down to a SQL warehouse, that's awesome. We don't have anything like that at data bricks, and I'm sure I'm glad that the folks at Prophecy have built this. So if you have people that you need more than just code first, I need to be able to support my users who think visually, or they need to be able to interact visually, but you want to go back and forth. Prophecy is great for that. And migrations. I talked about this a little bit before with the transpiler, but to say rapid migration of ETL workload is an understatement because what's not captured in that sentiment is the transpiler's ability to refactor the code for the cloud, because that in and of itself is worth its weight in gold. Because a lot of times what I see in these legacy ETL paradigmsthat are lifted and shifted into the cloud is about three to six months after goal live, the cost like skyrocket. It looks like a hockey stick. And it leaves business executives in this state where they're.

Maciej

Like, Was.

Franco

Cloud the right decision? Did we make the right decision here? Our costs are out of control. What happened? And that's because you took on premise code and you shoved it into the cloud when you're metered on everything and it's not working. So the way that Prophet thee actually takes the code from on premand transpiles it into Spark code that actually is refactored for the cloud is amazing. It will save you so much heartache on migrating the code, and then you won't run into this cost spike that most other customers get when they do a quick lift and shift into the cloud. So I believe if you have an ETL migration coming from any of those legacy products that Ajay talked about earlier, you should definitely do a quick POC with prophecy and data bricks to see if you can get some quick wins because I think it will scale significantly more efficiently than some of the lift and shift patterns that exist out there in the wild. Before we go, I saw that there was a quick question from my friend Ravit in the Slack and chat.

Maciej

About.

Franco

How does data bricks actually provide efficiency over just Spark? Because prophecy is awesome. It does produce Spark code, and then the new SQL product does connect to our SQL warehouse. But what if you're using another type of Manage Spark? What does data bricks offer? Data bricks is probably the most efficient Spark that exists in the cloud, and we have more than enough benchmarks to prove it. But essentially it's the underlying tech with our Photon execution engine and just everything works together. So you don't have to worry about all the configurations. And with prophecy, you don't have to worry about all the code. So that's why prophecy and data bricks together, when you're thinking about your migrations is the super simple button to cloud migrations, because you don't have to manage the complexity of Spark with data bricks, and you don't have to manage the complexity of code with prophecy. And so I think that prophecy and data bricks is a winning combination for your next ETL migration. And with that, I think we're going back to Maciej.

Maciej

Awesome. This was amazing stuff, Franco. Thank you so much. Let's deep dive now into the little bit more practical part of the actual webinar, which is going to be the demo itself. So as we've mentioned, even though one of our panelists was not able to join today, we're going to completely make it up for you with a really cool demo of some of the capabilities of databricks and prophecy. Let's just get straight into it. I have here a template that after this webinar also anyone will be able to directly start with prophecy, just continuing and building further upon the demo itself. What we're going to do today is actually very cool. Following this theme of generative AI, LMS, but also grounding it in the healthcare industry, we're going to build a really quick medical advisor that will actually ingest all the medical claims, research papers, clinical trials, process that information, and then enable the chatbot itself to answer questions based on it. The process of building it is, of course, if we start looking into it, might be very technical, but it's actually very straightforward. The only thing that we have to do is vectorise all of our text, vectorise all of our medical research content, and pipe it into one of the vector databases.

Maciej

In this case, we're going to be using Finecom. And then we can actually start asking questions for the chatbot itself. Sorry, I hope everyone can still see my screen. Yes, okay, perfect. And then we'll be able to start vectorizing the questions directly that are coming to our medical assistant, comparing them with the existing core pair of knowledge, and answering questions based on the most relevant documents. This sounds like normally it would take weeks and weeks or even months to build, where we're talking about to build it in 10, 15 minutes. Please bear with us as we go through the process. But let's embark on it. I have this prophecy template. I can just go ahead and fork it as my own repository. Here is going to be my own medical advisor and I'm going to create the template by myself. Perfect. Now my GitHub is cloning all of the code for me and I can just go back to prophecy and start building this whole repository up. Let's create a new project. This is going to be my generative AI medical assistant. I'm going to be using PyTorch I can spark natively on top of databricks, and I'm going to pick that existing repository that I just sparked.

Maciej

This is my advisor. There you go, perfect. Just like this, now I have this complete project already prebuilt for me that I can start enhancing and building further upon. This ingest the data, vectorizes it, and then answers the questions. This is something that is, of course, prebuilt already, but where is the fun in looking at stuff that was prebuilt? Let's just build all of this stuff up now from scratch. We're going to start by creating a brand new pipeline visually on top of the Proficy canvas. This is going to be my ingestion pipeline itself. Now, where are we going to start getting our medical data?

Franco

Well.

Maciej

One of the best sources is the PubMed Repository. I don't know how many of you are might be familiar with it, but PubMed is this huge repository of medical research papers, tremendous amount of data, including titles and abstract, all open and accessible to everyone. There is a whole dump, around 35 million of research papers that anyone can just start reading from. Let's go ahead and start interesting the data up. Well, first we need to read this index. Let's go ahead and create a new data set directly on prophecy that will load our articles URLs directly. In prophecy, you can just pick what data type you'd like to start with. Here I'm just using a web page directly and I'm going to read all of the article URLs directly from it. We can infer the schema, which should be in this case fairly straightforward. We're just loading up the whole web page itself. Perfect. So we have our URL and the content. That's all we needed right here. The only thing that we have to do is, of course, extract that content and write it directly onto our databricks the BFS storage. So I'll just pick my content column, rename it to text, and make sure all of my gems have really nice, reasonable names.

Maciej

So let's write it out. And we're just going to write this down as a standard text file. We can see directly, Prophecy provides you a file browser, so we can just quickly pick my project. What I'm going to be writing, my URLs. Awesome. Just like that, I have all of my URLs loaded. Let's just go ahead and start now downloading all of this corporate research papers directly. I can pick up where I left off. I can just start from my URLs. This time I would like to actually process this file as an XML since my URLs are stored within an XML compatible format and pick the same location that I started with. Perfect. We're going to be processing all of the links directly right here. The only thing that we have to just do here is make sure that our text is actually, of course, written on the other side. Let me start running this top level pipeline. We can see the full text, the full HTML, sorry, that is HTML compatible, completely written down. Perfect. Now we can load up all of the links directly from it. We will see all the URLs. Perfect.

Maciej

We can now start working with them directly. Let's just look at the quick preview of our actual data. There we go. We have around two and a half thousand files that we can start loading from. We see all of the file names directly right here. Let's just focus here on the valid files only. As you notice, there is a bunch of readme s and other files that we would like to filter out. Let's clean this a little bit. Perfect. I'm just going to use a little bit more of a complex RegExpression here that's going to focus on my XML GD file only. We can just run this filter to make sure that we have all of those files. Perfect. Now I'd like to start downloading them one by one. To be able to do that, I need to first create a URL based on that file name. As we only have the file names itself, we need to have complete URLs. Let's form a new column. It's going to be a URL column that concatenates the base URL with the actual link itself, the actual file name itself. Then also let's store our file name.

Maciej

Okay, perfect. Now with that, we can actually start downloading and saving our files. Let's go ahead, drag one more component that's going to do just that. We already have a bunch of functions that are useful and going to be very helpful. We can just download that URL and file name directly. Perfect. For the purpose of this, just so that we don't have to wait for hours to download all of the files directly, I'm just going to focus on one file only. Let's just limit this to one file and let's go ahead and run the demo itself. Okay, so now e have the files downloaded. I can actually show you very quickly how they look like, but let's start loading them up directly. Those files are structured directly as XMLs. They're fairly complex under the hood, but let's go ahead and start reading the article directly. I'm going to create my data set that points to the articles themselves, reads an XML, and then extract the citations themselves from the file. We can go ahead and start inferring the schema. Let me just very quickly show you how the actual content of the file looks like. Again, we're working with very complex data here.

Maciej

This is an example of the file itself. It contains our articles, IDs, titles, and the actual abstract content itself. Let's go back to prophecy. Awesome. Schema inference already finished as well. We can see now the schema directly right here and the preview of the data that we can load up very quickly as well. From there on, let's start writing those articles directly to our Delta tables. I can just drag and drop another target and build up my articles directly. This is going to be a catalog table existing directly here. I'm going to phrase it as the Bronze staging layerand store all of the articles directly. Awesome. Now we went through this process of ingesting all of the data. We can start vectorizing all of it as well now. Let's go ahead and create a new pipeline that's going to vectorize all of the data as well for us. Okay, perfect. This time, instead of just focusing on this one file that we downloaded, let me just work on the full corpora of all the available tables. I can just drag and drop this existing data set and let's load it up and let's see how we loaded and ingested our data.

Maciej

Here we're working with the full many millions of the actual research papers. This might just take a second to actually load up, but for the purposes of the development, we can of course limit the amount of data that we're working with as well. There we go. This is how fast processing all the 35 million research papers is, and we can see the complex underlying very nested data structure that represents them. We have what's most important to us, which is the actual abstracts of all the articles, but we can also see the titles of them and all the IDs. Let's go ahead and start by cleaning up the values a little bit. Again, we are working with a 30 nesting data structure here. What I would like to do is just start by forming the abstract as a single column. I should not have to work with this very complex data structure here, so I can just say, Hey, I would like to create an abstract and join all of the fragments of the abstract together. At the same time, let's also add the title so I can keep working with this schema super easily and see the actual article title.

Maciej

One click away, I'm adding the title as well. Now I can start also working with... Now I can also add the ID of my article. I have my PMID here ready as well. Now, instead of waiting every single time and debugging this pipeline, this takes a little bit of, of course, computational processing. Why would I do it for my Dev purposes? Let's focus only on the very first, let's say, 100,000 articles. Perfect. Let's now look at how other clean values look like. You can really see how performant databricks here is with prophecy and the final values that we are getting. Sorry, I used here the wrong column. This is supposed to be value. Go. We can just rerun it, make sure that this is correct. Perfect. Some of our articles don't have abstract, so we can just filter them out further in the process. Let's drag and drop additional component that's going to be focusing on the values only. Here, let's build a medical assistant that's focused on a particular domain so that instead of working on all the corporate research papers that would take a little while, we're going to focus here only on migrating specific research.

Maciej

I can drag and drop additional filter component, and this time filter for migrating related papers only. Okay. Perfect. Now we started getting our migraine research, just a few papers. Now the most important step now is to actually start vectorizing this data up. We need to vectorize it so that then further we can pass it to our models and find similar articles that we can then fit into our chatbot itself. To do that, I'm just going to use one of the machine learning components available out of the box in prophecy, Open AI itself. I can just drag and drop that and I don't have to write any advanced Python logic. I don't have to download any additional dependencies. Drag and drop and I can just pick to start the embedding themselves. In this case, I'm going to order it by my ID and the text that I'm going to create the embedding on is directly my abstract. Let's look at the schema of my component on the output. Perfect. Now we are actually starting to have our embedding and potential errors of Open AI produced then. Of course, if you want to use any other model for creating of those embedding, that's really easy to do so as well.

Maciej

Let's just run it quickly to make sure that we are getting the right output. Perfect. Now for every single abstract and for every single title, we have this really complex set of numbers that describe it in a vectorized space. Awesome. Now we just have to do the one last thing, which is to write out all of those vectors directly. Instead of writing them out to databricks in this case, I'm actually going to use another vector database called Fincom that's going to store all of our vector embeddings. Again, writing to any particular target or ingesting data from any particular source, super easy from prophecy. Just a few clicks away, we can set that up. We can just choose our index name here. And there we go. Now, let's specify our ID for each vector and the embed itself. Perfect. Now we can start running the pipeline itself. In Python directly, we can now start seeing also some of the vectors appearing. I already preloaded it with some of the vectors so that we don't have to wait for that. Now, let's go ahead and finally get to the really exciting part, which is building the actual chatbot itself.

Maciej

Let's build a new pipeline that's going to start answering questions from the user based on that medical knowledge that has been accumulated through the research of many, many years now. Let's start with the question itself. I already have a data set for it, so I can just drag and drop it. Here's my question. What are the common diseases associated with migraines? This is going to be a very simple batch pipeline so that every single question I can just hard code directly here. But it's really not that difficult to then take it into a live streaming product and build it, for instance, into your Slack chat bot or your Microsoft Teams chat bot. That being said, let's go get started. I have my question here. The very first thing that I, of course, have to do is vectorize it again. Let's create another vectorization component and we're going to be vectorizing our question column. Let's make sure this works. Let's load up the schema. Awesome. We have our embedding and the question itself.

Franco

Perfect.

Maciej

Now, the most important step is let's actually start looking up similar content that corresponds to our question. For that, we are going to leverage the native Vector database functionality for Vector look up. That's fine for database. Let's type our index name directly here. Choose our embedding. In this case, we are going to fetch the first three corresponding articles to our question itself. Let's load it up. Perfect. Now we have our embedding question and all the matches that correspond to it. Let's just run it really quickly to make sure our data is correct. Awesome. So we have... Sorry, I used the wrong token. Of course, privacy integrates with all the best standard practices for storing all of your credentials. So here we are using databricks secrets. We need to ensure that your token names are of course correct. each one final try. This has to work now. There we go. So we have our starting with our question. Question gets vectorized and now we are starting to have our matches for articles by ID that correspond to our question and the score of similarity. We fetched three articles right here. Let's just make sure to flatten this data structure.

Maciej

We're working here with arrays, so I'm going to just say, Hey, let's get out all of those IDs and call this much PMID and extract our final question as well. Now that we have our articles, let's just join them with the corporate of all of the articles and extract the abstracts and titles. We are working here with IDs only. Let's drag and drop our Vectorized data set again and join it together. Perfect. We just have to now join by the IDs of the articles that are found. You can notice that one of them is a string, one of them is a long, so we'll just have to do a quick simple cast. Let's just make sure to use slightly easier input names.

Franco

Perfect.

Maciej

As you can see, the product gives you feedback in real time whenever I'm making any mistakes. It just automatically tells me, Hey, here's the error. You might want to do better. Perfect. Now we have joined our articles. For every single article, for every single vector, we have now our abstracts and titles. Let's go ahead and just... Sorry. Let's go ahead and just accumulate all of that together. I'll just use a simple aggregate component so that instead of the three different rows, we have one single row on the other side. This is going to form the context that we are going to fit into the chatbot itself. I'm seeing that we are slowly starting to run out of time, so I'm just copying and pasting the expression itself. It's very straightforward, though. Connects the idea of the article, title, and abstract together. Then, of course, let's get the question itself as well. Perfect. Now we're going to have our single source of truth context that's going to be now provided to our chatbot. It's just a wealth of medical knowledge that is trying to close the answer to question that we posed in the first place. One final step, let's just drag and drop our Open AI to now answer and formulate the answer to our question.

Maciej

We are going to use the answer questions for the given context type of the query, pass the context itself, question, and then a specific template for our prompt. Now, the message that we'll be passing to our chatbot is essentially we'll be asking Open AI to answer the question to the best of its ability using its knowledge, but also the knowledge based on the research papers that we are providing it in the context to make sure that it's going to provide you at the other end, the references and the actual answer. Let's go ahead, make sure that this works. Perfect. Now we have the schema and we have the actual answer itself. Let's now try to run it and start seeing how the actual answer is formulated. Here we go. After now spending 15 minutes on building some of those fairly complex pipelines in a very easy way in low code environment, our question that we posed, what are the common diseases associated with migraines? Now we have a question completely formulated. Now we have an answer completely formulated to us by the chat bot. It tells us migraines and common morbid, here are some of the diseases, and it also references a research paper that directly answers this question.

Maciej

Very easy now for any company to start building some of those really complex generative AI applicationst directly on their data, just leveraging prophecy and databricks. All of the stuff that we've built, of course, you have as high performance code on Git on the other side. Every single gem turned into function. All of that available to you on GitHub to get started. And the development part, of course, is just the beginning. You get tremendous amount of other functionalities with the product. You can build more of those gems by yourself, get in the edge, and all the other things. Perfect. I think I ran over a little bit of time, but I'm going to leave back to you.

Emily Lewis

That was awesome, Maciej. Thank you so much. Very lively chat. We've got a number of questions. We won't be able to get through all of them, but we'll take a few. First question for Maciej is, is it easy to hand over data analysis from person to person?

Maciej

Yeah. Data analysis could be handled fairly easy directly through the local product itself. Help me figure out what is the actual question asking. But the whole idea of using prophecy is that now any single persona, doesn't matter if you're a data engineer that knows how to write those regEx expressions to one of your points, or you're a business analyst who might just want to leverage the already pre created business expressions for you, you can just start building out those pipelines in a very easy visual manner.

Emily Lewis

Got it. Next question is, what is the extent of data quality checks which can be put in the pipeline? Can we build intelligence to take care of variations in data and put in prophecy?

Maciej

So prophecy does natively support a simple data quality within itself. We also allow you to build in with other data quality providers. There's tremendous of companies that are there allowing you to run data quality on your data. But also databricks feature some of those functionalities directly within their platform. And Franco, maybe I know if you want to touch upon that a little bit. But so you have a breadth of different options there.

Franco

Yeah. With databricks inside of our new pipelineing service called Delta Live Tables and soon to be coming to databricks equal with streaming tables and materialized views, you will have the ability to express what we call expectations. An expectation is basically just an expression that evaluates the true or false. If it's true, it passes fine. I'm sorry, if it's false, it passes fine. If it's true, then it does not meet the expectation, and then you can do different things with it. The benefit of this versus other systems is other systems have to rescan the data to do data quality. We do data quality on ingestion, so you're not paying, again, to check your quality on the same data. I want to call out one thing, Maciej. What you just demoed, ingesting all the PubMed, and then asking it a question. I actually attempted to do this with a client when I first joined data bricks about four and a half years ago. It took me two days and the coding help of three engineers to get something working that was not as nice as this. I just want to put this under perspective. In the chat, people were saying how impressed they were.

Franco

I didn't know this was happening, by the way, everybody. I was flabbergasted because I actually had to build something like this four and a half years ago. Maciej just did it in, I don't know, was it 10, 15 minutes? I was impressed. I just wanted to call that out.

Maciej

Yeah, I really appreciate that feedback, Franco. I really loved some of the feedback that was also coming on the chat. Actually, some of those functionalities that we were showing you are better on me at the moment and are slowly rolling into the public preview. So we'll also take all the feedback for it and let you guys try this out as soon as possible. Thanks, Franco.

Emily Lewis

Yes, we are actually at the top of the hour. I know there was a few other questions that came in, so we'll circle back with everybody directly. So apologies, we didn't have enough time, but I think the demo was worth the time. We hope everyone enjoyed today's session. We do want to invite you to keep learning about our low code approach with a personalized one on one demo, or we have a 14 day free trial available. Links to both of those can be found in the chat. And then additionally, we'll be at the Data Bricks Data and AI Summit coming up in a few weeks. We have a happy hour we are hosting and would love to have the opportunity to meet you there. Or if you want to swing by our booth on the trade show floor, you can learn more via links in our chat as well. That is all we have time for today. Thank you again for joining this session. We hope to see you in one of our next webinars or at the Data and AI Summit. Thanks, everyone.

Maciej

Thank you so much. Take care, folks.

Speakers

Franco Patano
Lead Product Specialist
Maciej Szpakowski
Co-Founder

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free