How HealthVerity Automates Data Pipelines with Low Code
A modern data stack, designed and implemented well, gives organizations the flexibility they need to change data platforms, add applications, and migrate to the cloud. But data engineering and line of business teams struggle to modernize their data stack with traditional ETL methods. Organizations can meet these requirements by automating data transformation with low code.
Data leadership at HealthVerity knew they needed to transform their diverse healthcare data to make it easier for pharmaceutical companies to deliver data-driven innovations. This was no easy task, considering the various sources, formats, and incredibly large volumes coming in on a regular basis.
In this webinar, HealthVerity’s VP of Architecture and VP of Data Engineering join Eckerson Group VP of Research Kevin Petrie and Prophecy Co-Founder Maciej Szpakowski to explain:
- Why future-forward companies choose to build automated, low-code data pipelines now instead of using traditional ETL methods
- How pairing Prophecy's fully-managed, low-code data transformation platform with the Databricks Lakehouse Platform shrank from 2-3 months to just ONE WEEK the amount of time HealthVerity’s data engineering team spends building pipelines
- The best ways to easily turn visual data pipelines into high-quality code
Ashley Blalock Hi, everyone. Welcome again to our webinar today, how HealthVerity Simplifies and Automates Data Pipelines with Low Code. My name is Ashley Blalock and I am the Director of Demand Generation here at Prophecy, and I will be your moderator for today's session. Just a little bit of housekeeping before we begin. If you have any questions during the presentation, go ahead and go to the Q&A panel of your screen. You'll see it located there at the bottom and just go ahead and type those in. We'll do our best to answer all of your questions here live on the session. You'll also see a chat box there at the bottom of your screen. Just go ahead and use that as well for any comments you'd like to make during the session. But again, if we can centralize questions there in the Q&A box, that would be great. Moving right along, I'd like to go ahead and introduce our speakers for today's session. First, we have Anfisa Kaydak, who is the VP of Data Warehouseing at HealthVerity. At HealthVerity, Anfisa is working actively with their data operations team to streamline data pipelines and optimize operations.
Ashley Blalock: Anfisa has many years of experience working with healthcare data, running custom analytics, developing big data solutions, and leading innovative healthcare data transformations. Prior to HealthVerity, Anfisa was leading the International Data Engineering team at IQVIA. Also at HealthVerity, we have Ilia Fishbein, who is the VP of Architecture, where he has a dual focus of actualizing innovative SaaS solutions for the health care industry and championing low code no code tools for data operations and analytics. Having spent over a decade practicing data engineering, product engineering, and automation, Ilia is an advocate for creative, intuitive solutions that maximize the impact of users' expertise. Moving right along, we're really excited to also have Kevin Petrie, who is VP of Research at the Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data platforms. For 25 years, Kevin has deciphered what technology means to practitioners as an industry analyst, instructor, marketer, services leader, and tech journalist. And lastly, we have Maciej Szpakowski, who is the co founder of Prophecy, where he's building the first low code data engineering platform. Having previously built a startup making medical research accessible through ML and Spark, he dreams of a world where data is not feared but leveraged.
Ashley Blalock: And without further ado, I'd like to go ahead and hand it over to Kevin to kick us off.
Kevin Petrie: Okay. Thank you, Ashley. Really pleased to have the opportunity today to speak with folks about a market segment near and dear to my heart. I spent five years in the data pipeline industry on the vendor side with a tunity which is now part of Click. Since joining eckerson Group to run the research division about three and a half years ago, this has been a frequent area of research. It's something that we work with practitioners quite a bit on the consulting side as well. So I thought what I could do to tee things up for the healthverity team and privacy as well is talk about how we define the data pipeline market segment and what some of the forces are that are driving a pretty dynamic industry and give some context about some of the practitioner approaches that we see bearing fruit. So to set the table, I think we all would agree that the supply of data and the demand for data are booming. On the supply side, those three Vs, volume, variety, and velocity, continue to rise within many organizations, especially in the wake of the hyper digitization that occurred during COVID and since then, more and more organizations are digitizing more and more of their operations and their interactions with customers.
Kevin Petrie: And that means they're throwing off a whole lot more data. That's an opportunity in terms of analyzing what that means from a business perspective. On the demand side, users proliferate. They're using new devices, they're using new applications. Use cases are abounding, especially as more and more organizations seek to democratize access to data and drive more data driven decisions. So in between, you've got these pipelines that can have bottlenecks, data quality issues, budgets that go awry. There's a lot of challenge right now on data pipelines, on the data engineers in particular, as you got the multiplying sources and targets, business requirements changing, sometimes on a dime, and complexity mounting. So what we see is that data engineering teams and IT teams are struggling to build and manage the data pipelines that feed modern business. So modern data pipelines need pretty careful management. And I'll talk about the different market segments that we see here. There are a lot of different ways to slice it. But if we build up from the bottom on this chart, the first is that most organizations are... If they've been around since before, say 2010 or 2015, they've got some on premise infrastructure.
Kevin Petrie: They're moving to the cloud, taking more of a cloud first approach. So they have hybrid environments. Oftentimes they're selecting multiple cloud providers. This means you've got a pretty complex polyglide architecture in which you're trying to manage your data, your schema, other types of metadata. So you need to ingest the data, extract from the source, load it to a target, often to support analytics. You need to transform it, perhaps during the during flight or once it arrives on the target. I'll talk more about what that means. Transformation is critical. It's something that prophecy spends a fair amount of time on. Data ops means that you're looking at ways to optimize your pipelines through CI CD, continuous integration, continuous deployment or delivery of pipeline code, branching and merging different pipelines in order to optimize how they deliver data in a timely and accurate fashion. And then there's the need to orchestrate these pipelines. This overlaps very much with DataOps, but you'd want to be able to schedule and monitor and optimize the workflows across pipelines and then between pipelines and the applications that consume the data. So there's a lot going on in this segment, data pipeline management.
Kevin Petrie: It comprises these four segments, data ingestion and transformation, data ops, and pipeline orchestration. I'll just double click quickly on what we mean when we say data pipeline. The ingestion can involve extraction, capturing. It could come from a stream, it could come from a batch. And then transformation involves a lot of different types of data manipulation, filtering, data to identify what really matters for a given use case, merging data sets, merging tables, potentially changing the format, restructuring it, cleansing it, validating that it's accurate. And then the delivery, which may take place before transformation or after. You could have an ELT or ELT approach is going to involve loading the data, appending it, merging it, a lot of different ways in which you're putting the data on a target in order to support maybe operations, maybe analytics, maybe both. A lot of analytics use cases and analytics functions are getting embedded into applications. Okay, so I think that a critical piece of the data pipeline segment is data operations. And as I mentioned before, there are really four segments here. I've got orchestration here because I think it overlaps a fair amount with data ops. The first is continuously integrating and delivering data pipeline code, so frequently updating and improving the pipeline, the code that's going to accurately and reliably transform and deliver data for consumption.
Kevin Petrie: The more you can automate this, the more you can really streamline what's happening. You can improve agility and you can try to reduce the burden on data engineers so that other types of stakeholders in your organization can start to handle basic pipelines. So that's a major trend and there are a lot of commercial tools that are starting to automate different aspects of data integration, looking in particular at data ops and data and handling of pipelines. Orchestration is connecting all the data workflow elements and tasks and automating that data journey end to end. Testing is critical. Understanding is the data accurate, looking at samples, looking at checksums, looking at schema. If the schema changed, you need to make sure you're accommodating that on the target side and also looking at pipeline functionality, testing it both in development.
Kevin Petrie: And in production. Monitoring is critical. You need to understand where you are in terms of delivering the data, in terms of pipeline performance, and in terms of checking that data to make sure that you're getting the right notifications if something goes awry. So DataOps, this discipline, can help make data pipelines more efficient and effective. I'll talk a little bit about the benefits here. I think that organizations, there's a lot of pressure right now, as I said, on IT and data engineers, if they can reduce that pressure through automation, through enabling stakeholders beyond just data engineers to handle data transformation. In particular, they can start to democratize access and they can enable enterprises to extract more analytical value from their data. They can gain agility so that you can have a business unit that can more quickly spin up something new to address trends in the market that they're seeing. Or you could start to operationalize real time analytical functions to support, for example, machine learning code that's going to automatically generate a customer recommendation. Data Uptime, critical here. More organizations view data as the circulatory system of their organism, of their enterprise. And so you've got to have the uptime to drive smart decisions and smart actions.
Kevin Petrie: And efficiency and productivity obviously is part and parcel to everything we're talking about here, because if you can enable your data pipelines to deliver data in a more reliable and accurate way. And if you can improve the output per hour of the data engineers or other stakeholders who are handling data pipelines, you can really start to generate new business and new business benefits, both the top line and the bottom line. So effective data pipelines are helping capture the upside of data and reducing the downside. I mentioned before that organizations continue to digitize very quickly to create a whole lot of data, and you don't want, as an enterprise, to drown in that. You want to find the gems that are there so that you can gain competitive advantage because as an enterprise, if you don't, your competitors will. So I'll conclude with a couple of thoughts here, which is that there are a lot of data pipeline tools on the market. Prophecy is doing some great innovative work. And I think it's important to determine what you're really looking for. There were four segments we identified within the market, ingestion, transformation, data ops, orchestration, and there are suite approaches.
Kevin Petrie: There are also specialized tool approaches. So I think it's important for your team to understand what's the breadth of functionality that you're really looking for. Do you want to go deep on transformation? Do you want to go broad because you have pretty simple needs? Performance and scale is important. I think that is becoming less of a challenge for a lot of organizations. Spark, on which Prophecy is based in a lot of ways, really helps accelerate performance and helps using cloud infrastructure, elastic cloud infrastructure, deliver the scale and the performance that's required for a lot of workloads. Automation is very important. And so the more you can delight data engineers, other pipeline handlers, and reduce the time that's required, especially for repetitive tasks, the more you're going to realize the real benefits of automation in terms of productivity. And then extensibility. So this is critical. There are a lot of tools and elements, data sets, workloads in modern environments. And so it's important that you can add a component that you can extend a pipeline, adapt a pipeline to handle a new source, a new target, to integrate in a new way with Apache Airflow, which helps with workflow automation or workflow orchestration.
Kevin Petrie: So I think it's the extensibility is critical, and you want to make sure that your pipeline tool that you're evaluating is able to properly integrate with these different elements and ensure interoperability and open access to data. Governance, we could talk all day about governance. That remains a very high priority among folks that we work with. We have a research partner at Bark that does a survey every year of global enterprises about top trends in data management. I think for seven years, running data quality has been the top one. So there's a lot going on with governance, and what you want to make sure from a data pipeline perspective is that your data pipeline tool can produce the right visibility into data quality. It doesn't need to solve all the problems, but you want to have at least the right visibility and data quality so that you can help stakeholders attest to its validity. So what we recommend is organizations can prioritize their criteria by business need, define not just the current requirements, but the future requirements. Endpoints are going to change, processes are going to change, environments and infrastructure is going to change. If you can find yourself a data pipeline platform that's going to accommodate those future changes, you're really going to reduce downstream friction.
Kevin Petrie: And I very much encourage folks to keep in mind what are the 2024, 2025 requirements that they might be appearing. Final point here, critical. We're talking a lot about tools, talking a lot about technology, but people and process matter hugely. And I'm sure the HealthVerity team will get into that as well. How well are these tools enabling your people to execute process effectively? So select the data pipeline tool that's going to empower and streamline your business for the future. So now I think at this point I'm going to hand over to Anfisa. Is that right?
Anfisa Kaydak: Thank you, Kevin. Thank you, Ashley.
Kevin Petrie: All staff sharing? Yes.
Anfisa Kaydak: I'm delighted by this opportunity to talk about healthverity in our journey with Prophecy. Next slide, please. It's a couple of slides overview of what Health Verity is doing. HealthVerity Health Corps is located in Philadelphia. We currently have about 250 employees and probably half of the engineers work remotely across the US. We practice combined approach, not the combined, mixed approach. We have people working remotely, people working on site and trying to accommodate for the new reality as we're moving to this new reality. HealthVerity is providing an infrastructure for identity, privacy, governance and data exchange for pharma clients, insurance, and government. Our data is cross linkable and transformation support multiple healthcare industry standards. And we are managing about 200 billion transactions from over 340 million Americans. It's a lot, a lot of data. So we're talking about petabytes. Next slide. Yes. So we are trusted real world data platform for life sciences companies and federal agencies. And we are partnered with top pharma manufacturers to enable and manage patients in chronic situations. We have working with brand teams, with product teams. We have at this point over 250 living health care organizations who are leveraging healthverity data.
Anfisa Kaydak: Next. Our worst healthcare and consumer data ecosystem consisting of over 75 unique data sources from different healthcare data providers representing, again, as I said, almost US population. We are providing the clients comprehensive UI marketplace where they can select the area of interest and build the custom cohort that instantly tell you how many patients and groups and where it's overlapped. As they're going through this exploration of our data, they're building the data set that will be required to perform their analysis. Then this data set will be delivered to them in the form of that will be compliant with their internal data system. This is my short introduction of HealthVerity, a company. Next will be illia, who is VP of Data Architecture. HealthVerity, he started the prophecy journey.
Ilia Fishbein: Thank you, Anfisa. Yeah. Our prophecy journey, which I'll describe in how we ended up with prophecy, and then obviously we'll take over where we're going from here, really starts at those challenges that Kevin mentioned earlier. Exploration of demand for data in the healthcare industry, which has led to an explosion of data supply in the healthcare industry. Companies that never considered before monetizing their data because that was not their main source of revenue have now seen the opportunity of making data available, some of them because they have tons of data, and some of them because they have extremely unique data. And health there, it sits in the middle of that. So we are the ones charged with handling terabytescale data for these 75 plus different data sources. And we're constantly bringing in new sources as we identify them. And again, we already work with most of the major players in the industry, but some of the demand we're seeing is for small specialty pharmacies that again, very unique. No one has ever asked them for their data or no one has ever considered digitizing or monetizing it. And in doing so, we are constantly onboarding new data sources.
Ilia Fishbein: And when we build the company, we're really focused on scaling our transformation performance, our extraction, our loading. So we are very much streamlined there. And what it leaves is a hurdle... Or sorry, before we get to that, incentives. If we can do this, we can obviously get more revenue. The more data sources we have, the more suppliers and data deals and opportunities we can satisfy. And the faster we can bring them on board, we can satisfy opportunities not just broadly out there, but ones that are arriving just in time. Someone has a clinical trial that is closing in the next two months. They really want to augment that clinical trial data with some real world data, but two months from now it's going to be irrelevant to them. So they have found the partner or they want healthverity to work with someone, but they need to bring that data just in time. So for us, it's critical not just to have a really wide ecosystem of data, but being able to bring in those partners just in time. And the hurdle we're encountering is we can do the processing. You give us a terabytes of data, you give us multiple terabytes of data.
Ilia Fishbein: We have solved that with data breaks and other tools to get it in hours. Now our challenge and our hurdle is really on the planning. How do we design and develop these data mappings? Because health care is a highly specialized space. It's really difficult to train data engineers to understand health care data at its origin and how to massage it into an analytical form that then customers like pharmaceutical companies can use for clinical trial purposes, for going into new drug research opportunities and so forth. So we ended up organically with this complex process that involved a data architect who really is a subject matter expert in health care data and has SQL skills but doesn't really have big data skills. You have your data engineer who has big data skills but doesn't have expertise. And you have these handoffs between design, development, integration, source control. And then finally, we also have data QA engineers in the process because of the importance of data quality. And this process, along with several iterations in the loop, just takes multiple months just to come up with the transformation logic that will be applied in order to get it from whatever proprietary source we have into one of HealthVeritie's common models so it becomes actionable and sellable.
Ilia Fishbein: Additional challenges that come up from this organic process, difficulty to peer review the work. Even data architects are sitting next to each other as we are onboarding new sources. They have a hard time understanding what does it look like. There's a lot of artifacts being passed around. It's not easy to just pull one of these things up and understand, Oh, okay, I see what we did here and I can apply it to my use case. Similarly, knowledge share outside of the data organization, we have other subject matter experts in healthcare data task with other roles within the company, sales specifically, but also data solutions, data product packaging, and so and so forth. They have a hard time understanding sometimes, how does this particular source that I'm interested in, I think, is valuable for my data package, how does that fit in? What logic did we apply in order to make it analytically available, and how do we apply it more broadly? Finally, the organic process is just expensive and difficult to scale. We can hire people to do more of these sources in parallel, but one, that's expensive on itself, and two, it doesn't really solve that just in time data capabilities that we need.
Ilia Fishbein: If it takes 12 weeks for one person, you hire a second person, sure, you can do two, but the lag time is still 12 weeks. We need to shorten that 12 weeks to something that is more palatable to us and our partners. So we started on a journey to figure out how we're going to do this. The selection process took about eight months. We started in January of 2022. We ended roughly in late August of the same year. We reviewed many different vendors, everyone from your blue chip enterprises out there like Informatica to small startups that only recently got funded in the last year or two and have some really innovative ideas but potentially haven't executed on them yet. We really did our homework. I personally spent probably more than 100 hours in discussions with vendors, everything from requirements gathering, understanding their capabilities that changed our requirements, seeing their demos, their generic demos, they're more tailor fit demos for us, doing proof of value exercises, at which point now hundreds of hours were spent across the team validating whether these were the right tools for us, listening to customer testimonials. Prophecy and others were kind enough to share those.
Ilia Fishbein: Then finally negotiating and finally deciding on who we're going to do.
Ilia Fishbein: Is.
Ilia Fishbein: In short, we did our homework. Our goal was really to move from this disparate process involving multiple teams, roles, people, and a variety of artifacts into something a little more integrated and that really empowered the data architects. Something that Kevin talked about, this is what we really want to do. We want to put the subject matter expert in the driver's seat. They go from understanding the source data model to designing what the mapping looks like between that model and healthVerity, developing the code or the business logic, integrating that into our system, source controlling it for auditability and trackability, testing it themselves, and finally bringing it to production themselves. And that we envisioned as the huge savings that will enable the optimization that we wanted. Critical requirements that we had. We wanted a modern stack. We've been using Spark and Airflow and other modern open source tools that have come out of places like Airbnb and Uber and Facebook and others for the entire history of the company. We were not looking to move away into something worse, basically. Low code no code. What we're finding is that our subject matter experts in health care data are not coders.
Ilia Fishbein: A high code system doesn't speak to them and training them on high code systems is not going to give us the right results. Custom transformation capabilities. Healthcare has unique challenges in terms of how data needs to be transformed and health and in particular. Sometimes it doesn't exist out of the box and we can't afford to wait for vendors and partners to develop those capabilities. We need to be able to leverage our engineering capabilities to do it ourselves. Data preview, really about shortening that feedback loop to the data architect as they're doing development. They make a change, they want to see the results of that change. That's how they validate whether or not it's accurate. Version control, critical. I come from an engineering background, it's critical when you do code, it's critical when you do data logic. So it has to have some integration there. Finally, it's a low learning curve. What we didn't want to do is take an existing process that everybody knows, uproot it, replace it with a new process that will then take us months or maybe even years to get up to speed on. The answer to all of that ended up being prophecy.
Ilia Fishbein: These slides are actually from the deck I shared internally with our leadership team to explain to them why this is the right tool and prophecy's UI is so simple that even they understood it and some of them have nothing to do with data at all. One is this pipeline design interface. It's nice and clean and gives you a very good overview of what your transformation looks like at a high level. You can change your Spark configurations, you can do scheduling, you can actually view the code itself, the prophecy is generated on the back end, you can attach it to your databricks clusters, and then you bring in all the transformation elements in visually before you start getting into the details of the logic. Additionally, you have here at the bottom left those source control elements, so using Git with integrations in GitHub. Then finally on the right, you have a play button to run the entire transformation through or get previews of what different logic pieces look like. Then as you dive into the actual transformations, again, very clean, very easy to understand what is going on. You have your inputs of source columns on the left.
Ilia Fishbein: In the middle here, you have the target columns and expressions that you are applying, easy UI elements that allow you to bring everything in, take everything out, and just in no boiler plate code. You can very easily see what exactly are you doing with this data? It makes it both easy for the developer as well as other subject matter experts and peers to review what is going on. Then on the right here, you have a little bit of code snippet showing that all of this is converted into efficient Spark code that can run on any Spark system. This is an example of the preview page or the preview screen. You can see. Now you've written your transformation, you want to get quick feedback. Did I do it correctly? This gives you a very high level of what does the result look like? Here you can immediately tell, did I expect this to happen? Did I not expect this to happen? Do I need to change my transformations? In our current setup, it usually requires several loops, often handoffs, potentially go into another system. The integration of the preview ability and the development ability was great.
Ilia Fishbein: Then you can dive even deeper into each one of these things and get some more profiling information. Maybe the sample of the data, of the output data wasn't enough, you really want to understand, did I create too many nulls? What's the minimum? What's the maximum standard deviation? And other high level summary information that will let you know if you did this correctly. Data lineage has been a big conversation within healthverity. Once again, we are bringing in lots of sources, but often they're coming from the same domains, so medical claims, pharmacy claims, electronic health care records, or electronic medical records. One of the things that we often try to understand is, are we consistently applying the same logic to all of these things? Because they're complex, different people work on them, but we need to ensure that there is a consistency. The lineage view really helps us do that. We can track things like, how did this source column make its way to our target in this particular source or feed as opposed to another one? Or the other way around, what sources are mapped to this particular target in the event that we find an issue or we want to reconsider the logic.
Ilia Fishbein: Okay. Then the icing on the cake. On top of our core requirements that we absolutely had to do, this made it really easy for us to both use prophecy, adopt it, and then sell it to our executive team. First, it's deployed in health and release environment. We work in a regulated space. Privacy is important. Hipaa, GDPR, various other things are coming into play there, as well as we have government clients who really care about where does the data sit. So the fact that we can say it sits in a health Verity AWS account is very, very critical. Ssl, again, both from a privacy and security perspective for our business as well as the ease of use for our employees, SSO is just a critical feature. Open source code generation. So this was a really big selling point for our executive team. The fact that it's using Spark behind the scenes and we can build our own CI CD pipelines independent of prophecy really spoke to them. They loved the idea that should it ever come to that, there is really no vendor lock and we're using something open source and it is the same open source system that we use today.
Ilia Fishbein: Simple pricing model. The complexities that I've seen, I will not bore you with the details, but were very unappealing. Here we knew exactly what we were paying for, what we're getting and we weren't worried about what happens if we go over a particular tier of volume, suddenly the cost of this explode. Then finally, a great customer testimonial. Prophecy connected us with an existing customer who just raved about the system, and we love hearing that. I'm going to turn this back to Anfisa to talk about our journey in implementation.
Anfisa Kaydak: Thank you, I was just typing the answer. We started the journey by creating cross functional team. Why cross functional team? Yes, this too, preface it oriented for data apps who are not technical, but we wanted to augment the team with data engineer at the beginning so the adaptation will be easier. The team itself had four people. It was data engineer, data architect, operational engineer, or data ops, and QA person. The idea was data engineer will start working with Prophecy. He will build templates and he will train data ops people and data architects on how to use this template. In the future, they can just open the template and create another pipeline just by saving existing template and make modifications in the copy. This is the model we agreed from the beginning and we started our journey. Data engineer was starting to explore the interface. Learning curve was very quick, probably under one day just going through quite comprehensive privacy documentation, understand the interface, how to define the tools, but how to make reusable components in it. We created a simple pipeline, the data engineer created, a simple pipeline, what we call the template, probably within a couple of days.
Anfisa Kaydak: Integration with data breaks, our data warehouse and data breaks was very smooth. Our input was data breaks tables and output data breaks tables. We didn't plan to use Prophecy for any inter across the platform tool. As I explained, our use case was we're using the same platform, but we are using this tool to simplify transformations, how to write this quite conglomerate code with multiple case statements and renaming and mappings. Writing it manually takes a long, long time. Writing it through the interface, much, much easier. It was integrated with data breaks. Again, reading from data breaks, writing from data bricks, where we hit a little bit of the roadblock when we start integrating unity. We were working on this project in January of this year, and unity was released. And for those who doesn't know what unity is, it's a data catalog or meta store for data bricks, where this data metastore support information of schema and data lineage. And for us, it was very important that everything what we are developing, every transformation, going through the data catalog. And as we run these transformations, we are able to track lineage from the data breaks interface.
Anfisa Kaydak: Since we were working on this in January, U nity was only released for public... I think public release was in December, so it was literally one month old. We did hit a roadblock initially. Privacy did not support it. And we submitted the ticket. We had our own Slack channel with Privacy support technicians, not sure partners, and we were back and forth on our error messages and then identify what the problem is. And we had to do a little bit tweaks on how the tool was installed and we were waiting for feature release from the prophecy. All of this back and forth took about two weeks. By the end of the two weeks, we had everything seamlessly working with unity. We were able to ble to show transformations through lineage and everything stored in information schema. We also had a little bit of hiccups with moving our UDFs because in addition to Prophecy interface where you can create reusable components through diamonds, we also wanted to use existing components, what we already have, and we call them user defined functions, the UDF. It was a little bit of hiccups there, but also probably spent a couple of days to resolve it and everything was working seamlessly.
Anfisa Kaydak: By the end of three weeks, Data Engineer was able to transition this pipeline fully integrated in our environment and run the training for DataOps team. At that point, the integration, the implementation journey was completed. That's it. This is our experience. I didn't realize I have the final slide, but it was a pleasure working with Prophecy. I do enjoy this exchange and I believe that they have a bright future and looking forward for the new features that they will be working on over the next years.
Ashley Blalock: Thank you so much, Anfisa. Really appreciate it. Both Anfisa and illia, fantastic presentations. With that, I'd love to hand it over to Maciej to go ahead and demo prophecy. Perfect. Thank you so.
Maciej Szpakowski: Much, Anfisa and illia. I think I can speak for the whole company. It's been a true privilege to have you guys as customers. Thank you very much for this deep overview of your use cases. I'm sure a lot of the audience could really relate to this as well. Kevin, your market overview was really spot on. That was awesome. We've heard so much good stuff from those guys. I'm sure a lot of you folks are curious also how can you get started and experience all the amazing benefits that Anfisa and illia talked about. I'm going to show you that right now. Let's deep dive into it. I hope all of you can see my screen. Ashley, please yell at me if not. We will start where a lot of you, I'm sure, might already be on data breaks. If not, don't worry, also prophecy works on any other spark. If you are on data breaks, however, even as I'm speaking through, you can already start trying prophecy out. We are really good partners with data breaks. One click partner connect, and then you can just click on the prophecy tile and you'll be able to get started with the product and you can even follow along with the really quick demo.
Maciej Szpakowski: But so let me show you how to actually build a complete pipeline. I think illia really touched well on this point where he discussed this transformation routine, many different stakeholders having to build tests, deploy pipelines. Let me show you how to do all of that in just five minutes for one complete pipeline all the way to production. We're going to focus a little bit more on health care, a related use case as well. I already have here an existing project. This project is linked to a repository. Everything is well versioned on Git with commits. So prophecy is following all the best software engineering practices as you keep using the product. If you're a developer, you love this view. All the code available to you right there. If not and you just want to get your job done, no problem, we can immediately deep dive. Let's go ahead and create a brand new pipeline in this particular project. There we go. Here we have a project that's already set up with some health care data, some patients, outcomes, and visits of patients, inpatient visits into hospitals. So let's build a simple monthly report for that. Once the pipeline is created, we land up on the canvas.
Maciej Szpakowski: Now on this canvas, we can actually start building out the actual reporting workflow. Couple of important shout outs. Right top is where our Spark cluster connection is. So with one click, I can connect to an existing data bricks cluster or any other Spark cluster. And it's just as simple as that. Now we'll be able to not only write code, but also run it, execute it, make sure we are doing actually the right thing. On the left hand side, we have all of our gems that allow us to connect to our data sources. So we can very easily drag and drop them and start building up our pipelines. And there's other gems that allow us to build transformation logic. So let's go ahead and build out our pipeline. We are going to leverage some of the patients and encounters data and then build the report on top of that. So I have here a source. I can just connect directly to my patient's data sets. I have this data set already defined as a unity catalog table on data breaks. As Anfisa mentioned, Prophecy now supports really well our unity catalog tables, so we will be able to connect to that.
Maciej Szpakowski: And I can see all of the tables and databases visible to me right here. Perfect. One click away, I can preview the schema, make sure that this is the right data set that I'm trying to ingest, and then look through the actual data. Prophecy, of course, allows you to connect to pretty much any other data source that Spark supports. Plus, we have many enterprise ready connectors that will be tips of your fingers. Perfect. Now we have our data set created. That was super easy. For the purpose of this, we'll need one more data set with our encounters, which is where patients are actually going and having a hospital or doctor visits. Here we go. Let's just drag and drop that encounters table. I have already set it up before, so it's just a simple drag and drop here. Now that we have both of those sources ready, we can try to join them together. I can just drag and drop a join component and get snappeded automatically into my pipeline and connected together. This is a first transformation component that we are looking at and prophecy makes it really easy to build out transformations in the product.
Maciej Szpakowski: Left hand side, you have your schema, so you know exactly what you're working with and all the types for them. And on the right hand side, you see the actual transformation logic. Once you start typing the transformations, you'll of course know what are the columns available, what are the functions available, so that it's really easy to build it out. Now, if you do make a mistake at any particular point of time, of course, the tool automatically tells you about that as well. So very easy to recover from that. Pretty much the only skill that's really required for you to be productive here is knowledge of those Excel functions. You know how to write basic expressions in Excel? Perfect. Now you can become a super productive data engineer on Spark with Prophecy. Cool. So we now have our join running. We can actually go ahead and start testing our pipeline, seeing the data and making sure that it's doing the right thing. So one click away, I can interactively run this pipeline in real time and I can see the data flowing really fast for me. Here I'm processing tens of thousands of records and I can actually start previewing and seeing those data samples.
Maciej Szpakowski: As illia mentioned, not only the data samples are available here, but also some basic statistics about your actual data so that you can make sure that there is no nulls or the profile of the data is looking correctly. Really useful when you're building out your transformations to make sure that you're doing the exact right thing. Perfect. Now we have our join ready. Unfortunately here, there is many different types of encounters that have been here. I would like to only focus on impatient encounters. Let's put a simple filter component that will allow me to filter through that data. I'll focus only on the inpatient encounters. And we can, of course, keep testing as we're going. Now that we have our data filtered and we narrowed down to the 7,000 encounters that we really cared about, we can actually build out our report and aggregate the data very easily. Here we're going to calculate the number of visits and the number of average encounter time. We're going to do this for each city, and each region they encounter has been completed. Perfect. It was just that simple. Now we have our aggregate running and we have all of our data from our report coming out and we can now see exactly for each city, for each different types of diseases, how many encounters have been occurred and what was the average time span of each encounter.
Maciej Szpakowski: Perfect. This is almost ready. I would like to clean up this report, however, just a little bit. We can do that by leveraging some logic that actually one of my other colleagues have already built out. Prophecy allows you to build share logic freely between your team so that there is less repetitive work. Really easy. I can just drag and drop this additional sub graph connected right here and run it. Now your teams will be able to be even more productive. Here we go. Now that sub graph cleans up my data a little bit, I can see a little bit more detailed descriptions for the reasons and cleaned up length hours. Perfect. Now our report is almost ready. Let's just write it out to a table and we'll be basically done with our development step. We can create a brand new table on our Lake House. Just like this, we have our complete pipeline fully ready for operationalization complete. We can just go ahead, run it one last time, make sure it runs. As we mentioned, we do want to run through a full cycle of not only developing the pipeline, but also testing and releasing it.
Maciej Szpakowski: Let's really quickly add it up to our schedule and go ahead and release it from there on. Within this project, as we mentioned, we already have some ingestion and cleanup running, so now we can just add our monthly report to run directly after our clean jobs are finished. Now that this monthly schedule runs, every single time, my report is also going to be billed out on the fresh data. Okay, awesome. Now we have the schedule created. Let's go ahead and actually release all of that hard work that we've just put in into production environment. One click away, we're moved into our Git screen where we just have a couple of simple clicks. We'll be able to propagate and push for our changes to fraud. Everything done on prophecy happens directly on Git. We allow you to essentially leverage all the best software engineering practices right here. Let's start by committing all of our code. Perfect. Now, just like that, our code is committed to Git. We can also start seeing it. The last step is just to merge it back to our production branch. We can leverage here pull request, of course, mechanisms so that we can have other folks on our team review our code.
Maciej Szpakowski: But here, just for the purpose of the demo, let's go ahead and merge it directly. Just like this, now we'll be able to see that on my Git repository, I have my brand new commit directly within my main repo, all clean code generated for us. And of course, we can also look at this clean code whenever we want to directly in our tool. And we're going to look at it in a second. Let's just go ahead and release all of this to production so that our schedule actually starts running. And this is where the actual CICD steps will happen, where we build the packages, the tool will test them automatically for us, and then throw them on the schedule and kick off the actual jobs. This process usually takes around a minute. But in the meantime, we can also look at all the code that was generated for us. Prophecy is not just a visual tool where you build out your pipelines in a low code, easy manner. Everything is stored all the time for you as high quality code on Git that engineers can also edit. Every single gem in prophecy has been generating a function, and I can actually deep dive into any one of those functions like here our filter and see very high quality code as if a data engineer has written that.
Maciej Szpakowski: Perfect. Now our job is almost ready as well. We basically, in just the 10 minutes, build out a completely new pipeline scheduled it and released to production, a task that usually required many different stakeholders in the same process. Here can be done by pretty much anyone. Thank you so much, guys. Ashley, back to you.
Ashley Blalock: Thank you so much, Maciej. Awesome demo. Yeah, we just want to invite everyone. So everything you've seen today with the demo and everything you've heard from illia and Anfisa, we really want to encourage you guys to now go out and try Prophecy. We do have a free trial available at prophecy. Io, so please check that out. We'll also be sending a follow up email to include this link as well. So please give it a try and let us know if you have any questions. We'll be more than happy to answer them. Speaking of questions, we'll go ahead and jump into the Q&A. We have had a lively Q&A panel and we definitely appreciate that. So we'll do our best to answer these live. We know that we won't be able to get to everything, but again, we will be following up soon and would love to take a deeper dive with you all. First question for illia, how does your team manage the dependencies between data pipelines? Can you elaborate just a little bit more on that?
Ilia Fishbein: Sure. Depending on which dependencies we mean. If it's code or library dependencies, you can set those up in prophecy itself in the UI. If it's dependencies from a timing perspective where you need to have one job run and another job run. We at the moment use Airflow, but also prophecy has tooling within their low code scheduling that allows you to do that.
Ashley Blalock: Great. Thank you, illia. And Anfisa, we have a question. How was the data model part handled during implementation? Can you elaborate a little bit more on that?
Anfisa Kaydak: Well, this is not... At least I don't consider Prophecy being data modeling too. We already had data model and it was basically schema on the read for privacy. We were defining our input as external table and creating output that was also defined, predefined model before. At least this is how I consider it. I'm not sure if this is on purpose roadmap, but this is not a data modeling tool.
Ashley Blalock: Thank you for answering that. Next question that we have, Maciej, how is the pipeline monitored in prophecy? Is observability, is it out of the box or can you again just elaborate there? Oh, yeah, I.
Maciej Szpakowski: Love that question. If we had more time for the demo, I would love to actually show you some of this stuff. Prophecy does have really good observability capabilities. Whenever you're building the pipeline or even after you have deployed it, you will be able to track all of your historical pipeline releases. And for each one of those, you'll be able to see when does the pipeline run, how much data it process, that every single stage you'll be even able to see those interim previews that I was showing you. So all of that stuff is very much supported. And then that allows you to essentially have that runtime based observability. A little bit more on the static side, Prophecy does have support for lineage as well. So then you can actually see not just one particular pipeline and one particular schedule, but all the dependent pipelines on it and all the pipelines that might use new data from it. So then you can explore your whole projector your whole system step by step.
Ashley Blalock: Great. Thank you, Maciej. This question for Anfisa and Ilia, if you want to elaborate too on this, but how technical does the user a prophecy need to be? What should their ideal technical skills look like?
Anfisa Kaydak: As I mentioned before, we did involve data engineer from the beginning, but only I think it was just learning curve. But prophecy doesn't require to possess data engineering skills. It's designed for data operationalists, data architects, people who... You can create pipeline without knowledge of Spark or even SQL. It's just interface. Of course, you need to understand the concept of databases and what type of tables and basic data flowing and transformation. So it's not sure what this position is. Maybe data operations engineer, data ops, product ops. Great.
Ashley Blalock: Thanks so much, Anfisa. We are right at time, so I'm going to go with one final question. And again, there's so many great questions here, and we'll be following up to answer those very quickly. For Maciej, how was Prophecy's solution different from what a DBT might offer?
Maciej Szpakowski: That's a really good question. Prophecy is a low code solution, which means essentially that it allows you to visually build your pipeline with little to no knowledge. If you're already a programmer, of course, to make you even more productive, DBT is a little bit more of a build system. It still requires you to write SQL code. It requires you to be an expert in writing that SQL code and performing all of those different CACD operations completely by yourself. So it's not a low code product per se. If you want to learn more about the differentiation, by the way, this definitely heat us up. Let's have a chat. We'd love to share with you something that we are cooking up as well. Really exciting stuff.
Ashley Blalock: Awesome. Thanks so much, Maciej. Thank you all so much for joining our webinar today. We really hope you've enjoyed the session. And we do want to go ahead and invite you guys to continue with your learning journey with Prophecy and our low code approach. We do have another webinar coming up. Maciej, if you'd like to show that slide on April sixth. And this is with the Texas Rangers. So this session is on Moneyball, how the Texas Rangers use low code data engineering and analytics to identify MVPs. So we're really excited with this session. about this session and hosting it with our partners at databricks. So that link to register, I believe, is in the chat. So we'll give everyone a moment to grab that. And again, we also will be following up with the recording from this session, and we'll be providing a link to our free trial as well as, again, this webinar happening on April sixth. So hope you guys can join us. All right. Thank you. Yeah. Thank you, Kevin. Thank you all again for joining. Thank you to our wonderful speakers. We loved having you guys here today, and we will be following up soon with all of our attendees.
Ashley Blalock: Have a great day, everyone.
Ashley Blalock: Thank you so much, everyone. Take care.
VP of Data Engineering
VP of Architecture
VP of Research