On-Demand Webinar

Moneyball: How the Texas Rangers use low-code data engineering and analytics to identify MVPs

The sports industry has seen a significant rise in the adoption of data and analytics to gain a competitive advantage and improve the fan experience. The Texas Rangers are at the forefront of this movement, as they analyze terabytes of in-game data to optimize player performance and scouting. Join this webinar to learn how the Rangers’ data team overcame the challenges of technical resource constraints and the complexities of scaling real-time data pipelines with Prophecy’s low-code data engineering platform.

  • How the Rangers use data to identify and evaluate potential players
  • Why they chose the Prophecy low-code data platform as the data engineering foundation on their Databricks lakehouse
  • Best practices and tips for becoming a high-performance data engineering team

Transcript

Ashleigh

We have Alexander Booth, who is the Assistant Director of Research and Development of Baseball Operations at the Texas Rangers. Alexander is entering his sixth season with the club, and in his current role is working to revolutionize his data teams with a modernized data strategy. In prior roles, Alexander has specialized in machine learning, engineering, and cloud development for Data Science. Alexander holds a Master's Degree in Data Science from Northwestern University and is also a proud graduate of Washington University in St. Louis. We're really excited to have him here. We're equally excited to have Franco Patano, who is a lead product specialist at Databricks, where he brings over 12 years of industry experience in data engineering and analytics. He has architected, managed, and analyzed data applications utilizing SQL, Python, Scala, Java, and Apache Spark, as well as experimenting with data science. Next, we have Mei Long, who is the product manager at Prophecy. She is on a mission to make data accessible, usable, and manageable in the cloud.

Ashleigh

Previously, May has instrumented roles working with teams that have contributed to Apache Hadoop, Spark, Zeppelin, Kafka, and Kubernetes projects. We're really excited to have all these speakers here, so go ahead and join me in welcoming them. And without further ado, I'd like to go ahead and hand it off to Alexander to come on stage kick this off.

Alexander Booth

Awesome. Good morning, everyone. Happy to be here. As they always say, I know how to make a machine learning model, but no idea how to share screens in Zoom. So let's see if I can get this working. Let's go there. Let's go here. Ready. Moneyball, the texas of the Rangers is low code data engineering and analytics. I don't know if my MVP is love it. Thanks for joining us today. Again, I'm Alexander Booth, Assistant Director of R&D with the Texas Rangers. It's an exciting season for us. It's an exciting season in baseball. There's a lot of revolutionary stuff happening in baseball this year with rule changes and increased pace of play and the Texas Rangers being good at baseball again. You never know. Anyway, I am here to talk to you guys about big data in baseball and its comp comparisons and differences with the rest of the industry. Before the Rangers, I did work in industry, so I have seen some of these problems before. So being able to understand how databricks prophecy and modern data strategies can help us get that new and competitive advantage on the field by being one of the first clubs to be able to analyze data or being able to process data is going to be important to the team.

Alexander Booth

So a couple of big problems that the Rangers experienced over the last few years. Big data, big problems. There's been an explosion of new data sources since 2020 in our industry. And I'll go into a couple of what those are because, again, I'm not expecting you all to necessarily have a huge background in baseball. These new data sources require big data transformation. And we are a small shop, we're almost like a tech startup inside of the Texas Rangers. So we have a very small engineering team, a lean engineering team, and a very limited knowledge of Spark. Contrast that also with another problem that, again, is pretty standard in industry. Baseball is a team sport, so that shouldn't mean that our data teams should have some centralized communication with each other. And unfortunately, our siloed data teams resulted in scattered data products. Nobody knew where certain model outputs were. Nobody knew where different KPIs existed. We had too many databases. We had too many people altering those databases. When you also have siloed data teams like this, one other big problem that comes around is turnover. So let's say some people leave a data team and they don't document what they've done.

Alexander Booth

This can result in duplicative work being created, and this can also allow for lost work. We've lost KPIs and models after we've had some turnover. So the decentralized governance result in the lack of standardization. Some models took in some already one hot encoded output, some models did not, no centralized communication, no documentation. So one thing that I really wanted to change about all of this is to increase monitoring and code reviews around our ML and analytical pipeline. So we'll cover both of these big data as well as these siloed data teams as a problem. We'll start with data. This is all the fun stuff, right? I mean, what data is actually generated in baseball? So moneyball is the big key word and it's actually permeated into the data industry now as that catalyst for making data driven decisions. So your two second recap, Moneyball, Billy Beem, Oakland A's. Oakland A's are a small market team. They don't make a lot of money. So to get a competitive advantage, they needed to understand market inefficiencies, and they identified that via data. And it revolutionized this whole idea of getting a competitive advantage through data driven decisions.

Alexander Booth

So 2006 is when new data started appearing in baseball. Prior to 2006, all data that was recorded was pretty tabular. So this meant things like home run counts, RBI counts, hit counts, and you can do all that in a spreadsheet, essentially. You just toggle it all up and then do some averages, make a little pivot table, call it an analysis. But starting in 2006, we started using pitch effects, which was a ball tracking system that allowed to capture spin rates of the baseball as well as the velocity and movement of pitches. And this led to the revolution in 2015 with the debut of Statcast. If you ever watched an MLB broadcast, you'll likely hear the word Statcast dropped multiple times now. Statcast is a radar and HD video system that measures all action on the field for every single pitch. So not only are we tracking spin rates, velocity, movement, but we're also tracking things like player position. We're tracking every step that anyone on the baseball field makes throughout the entire game. And now this also has led into, as we'll get to in 2021, some more advanced tracking systems there as well. So next couple of years, they just flip flopped from their providers.

Alexander Booth

So we went to TrackMan, which also is in Gulf. And then we switched to Hawkeye, which is also in tennis. If you've ever seen the line reviews in tennis, that's tracked using Hawkeye technology. So this switch to Hawkeye has allowed us to increase the amount of technology track. This includes things like pose tracking and field vision. So we're able to now track skeletons of the body. We're able to track things like your elbow, head, shoulders, knees, and toes. And I got a couple of slides there where you can actually see that in action. This has led to a new field in baseball called biomechanics, where we can strap people in and look at their skeletons and movement and understand if there's any efficiencies in how their bodies move. And that, of course, can lead to things like injury prevention or understanding fatigue, etc. And then the last note that I wanted to put here, typically when people think of Major League baseball, you go, Oh, it's only 30 baseball teams. How much data could that actually be? But baseball is way beyond Major League baseball. So first I'll touch on the minor Leagues here. So to get into Major League baseball, a prospect typically has to rise through the ranks, going through levels such as single A, double A, triple A, before finally making it to the show.

Alexander Booth

We want to be able to understand those minor Leagers. Those are the future of the game. We had a couple of big prospects debut for the Orioles that the other way, Gunner Henderson and Grayson Rodriguez pitched against us yesterday. So we want to understand how these players perform at a minor league level to be able to understand how to be effective against them at the major Leagues. And so Stadcast is now moving its way through the minor Leagues as well. So now we're going to be tracking these skeletons at every game in Triple A as they as well as the major Leagues. So just a couple of diagrams here. So every baseball game you go to, even if you don't know it, if you look around, you may actually see some of these cameras. They're all installed around the top of the stadium. On the right, you can see that it's like this big black box. There's at least 12 cameras installed at every stadium, and some teams have opted in to install more. Again, these are tracking the action at up to 300 frames per second, which we then need to analyze to be able to understand things like player movements and biomechanics in motion.

Alexander Booth

Biomechanics, I love this stuff. I think the skeleton tracking is amazing and is a revolution in the game. They've also now rolled out skeleton pose tracking in other sports like American football and basketball. So with skeleton tracking at the Major League level, we're able to understand exactly how a pitcher moves when he throws the ball. We're also able to exactly understand how a batter swing. I played Little League when I was growing up. I also played some in high school as well. That's not good enough at all to be on the field. But one of the things I remember all the coach is saying is to open up your hips more and take a longer stride length and bend your knees as you're ready to catch the ball. We're now able to quantify that. I can tell you exactly how open your hips are, how long your stride is, how big your knee bend is as you're waiting to field the ball. And that is a revolution in the game in and of itself. But of course, the data is only good enough if we can analyze it, which will lead us to our big problem.

Alexander Booth

Couple more crazy data sources in baseball that again, not a lot of people may recognize. So spin, how a baseball moves throughout the air. Before we track man and pitch effect, did more of an inferred model for this. But now, because of these high speed cameras, we can now observe the spin of the baseball. So Jacob DeGrom, during 100 miles an hour fastball, we can actually tell you exactly how the baseball spins in flight, as well as exactly how it moves in an observed way. We can also see exactly how the ball is moving off of his fingers where on the seams he's holding the ball. And again, this is huge data at a very high frame rate, which leads to problems with processing. This is new for 2023. The big one that Major League Baseball has announced is this new concept of weather tracking. And we'll get into the lighter scans in a second, too. The idea here is that if we're tracking the weather at every stadium at I think it's around every five seconds, we can understand things like how much temperature and wind cause effects on the field to occur. So for example, we all know that if I hit a home run into the wind or a fly ball into the wind, the drag of that and the wind pushing back will likely lead that to a fly out and not a home run.

Alexander Booth

But if the wind is at the back of the ball and the wind is pushing the ball, then typically that will become a home run. That's independent of player talent. If we took a player that was lucky in their home runs because they always hit it into the wind, and at the Texas Rangers, we have an indoor dome stadium, if we stick them in Arlington, are they going to still be as effective when there is not necessarily that much wind pushing the balls out into right field? Another big concept here, too, is LIDAR scan. So baseball is unique amongst almost all sports in that the stadiums in which the game is played are not uniform. Every stadium is unique, and that causes its own dynamic and strategy in how baseball is played at these stadiums. And also, teams change their stadium dimensions every season. For example, going into this season, both the Tigers and the Blue Jays have changed their outfield dimensions. And that, of course, plays differently. There could be more home runs. There could be more doubles and triples. The run environment, as I will call this, has changed. A big example of this, too, happened in 2021, I think, when the Opsies in there, the Oriel actually moved back their left field fence.

Alexander Booth

And they used to have a bunch of hitters that were really good at hitting home runs to left field. You move the fence back, it's now harder to hit home runs. And all of a sudden, Camden Yards went from being almost like a hitters park with lots of home runs to left field to more of a pitcher's park in how that plays. And of course, this affects player performance and identifying MVPs to bring it back to the title of the talk. So this is a smattering of various technologies. And again, one of the big problems is that more and more of these technology companies and startups are coming to fruition every year, and they're all trying to sell us their products. They're all trying to expand their tracking systems. One thing I also haven't really touched on is amateur and international. We now have data from a bunch of colleges and high schools. The World Baseball Classic happened recently as well, which is amazing. We now get data from Japan, Korea, the Czech Republic, Dominican Republic, Puerto Rico. How do we now quantify and aggregate this large data at scale across all of these different vendors?

Alexander Booth

And I really quickly like this slide as well. Again, not a lot of people may understand exactly how baseball operations data teams are broken down. While we could have something like accounting, finance, and HR, which we do, of course, at the larger scale. In my specific role in baseball operations, our data teams are spread out in these more specific areas in terms of baseball. So I just touched on amateur analytics. We have a whole staff devoted to analyzing the amateur game. And of course, this preps us for the draft. Advanced scouting, we have analysts devoted to the Major League level. How do we advance scout the team that we're playing tonight? We're playing the Cubs this weekend. So how are we going to get the competitive edge against the Cubs at Wrigley? How are we going to define our in game strategy? Player development optimizes our minor League player performance. How can we make sure that we're getting the most out of our prospects, giving them the best chance possible to make the Major League? Pro Analytics is more around the trade deadline, what is happening at the pro level to maybe infer things on player development and advanced scouting.

Alexander Booth

There's always a couple of players that are huge surprises, and we need to be able to understand exactly what they're doing to be effective, even if they belong to other clubs. And of course, international analytics, which happened a lot with the World Baseball Classic. How can we find these international players that may revolutionize the game if we're able to bring them into our team? And this is all supplemented with baseball systems, our data analytics, our application, data engineering. How do we provide the data effectively for these teams? And then the more new team, sports science, which is training and conditioning, biomechanics, reaction times, all that good stuff as well. So we've now covered the problem. We have the large amount of data. We have the siloed data teams. We have this lean technology team as well that has support all of it. So how do we build a modern data strategy off of this? So a couple of our core ideologies that we want to roll out. So you want to migrate from a legacy two tier warehouse system to a modernized data lakehouse. And we'll get to a more of a data mesh situation as well on the databricks.

Alexander Booth

We want to create the self serve data mesh for transparent data availability, and we want to federate governance while still keeping responsibility at the Edge. So I want my data producers to still be able to create their pipelines, create their analysis, but have a little bit of federation to that. And that means that there will be some more centralized requirements to keep the data products, to maintain integrity, to make sure we're all working on the same location. We all want to be able to manage and monitor and see exactly what everyone is doing. That communication across teams could provide a competitive advantage. So a couple of core data tenets that I threw out there. We want to make want to build an analytical ecosystem that scales as new data sources are brought to market. We want to provide choice in analytical technologies. I want my analysts to be able to use R, Python, SQL, AutoML, Tableau, Power BI, whatever they want to be able to solve their problem that they're currently working in. We want to allow for autonomy and agility across my disparate data teams while avoiding bottlenecks with bringing productionalized data products across our highly distributed team.

Alexander Booth

Again, our minor league teams too are also situated in North Carolina and Arizona. I have members of my team across the entire country, and I need all of them to be able to work together to produce these data products at speed that are still in a productionalized sense and can be monitored by the overall data ecosystem. And of course, I also want to encourage communication and collaboration. Just like in the game of baseball, we only succeed as a team. And a little equation here from our world analysts, mindset plus people plus process times technology. We have to have a good mindset. We have to have a collaborative mindset, that innovative mindset to change. We have to have strong people with strong talent and data science skills. We have to have good processes in place. And then the only way that can actually succeed at scale is through the right technology, which is where we'll get into with prophecy and databricks. We'll go a little bit quick on this stuff. This, again, is a classical problem that a lot of modern data teams are running into, even outside of baseball. This is the limitation of just having a two tier architecture of a data lake and a data warehouse as separate entities.

Alexander Booth

Increased maintenance efforts, since compute and storage are not separated, it's very expensive and you're not going to be able to scale as effectively given those cost constraints. Sometimes you're locked into proprietary data formats and proprietary data transformations, these licensed softwares. And that's really not great to perform large analytical queries and make those reproducible and difficulty adapting for a large variety of data sources. Not only are we getting text data, video data, but now we're getting so many of these streaming data sources, IoT sensors, as well as, of course, the classical CSVs, JSONs, par case, etc. As well. The lack of data transparency across our warehouse has become a bottleneck for the development cycle of our analytical models. We don't know where data is. Sometimes the data is not clean enough. Sometimes data is not processed enough, and we're not lean and agile enough to be able to create those changes at speed to allow for our analysts to create new data products quickly. This is again what we're trying to get through a modern data community. I've mentioned these terms before, data producers and data consumers, and most of the time your data consumers are also data producers.

Alexander Booth

How can we get everyone to work together in this holistic, l akehouse approach? So our idea here is to build this almost data mesh where our sub teams, our amateur analytics, our pro analytics, our national analytics, our minor league analytics, they can all manage their own data sources. They can all manage the models off of those data sources individually. That responsibility being pushed to the edge. However, there's still that centrality of all of it existing in the same workspace, the same lakehouse underpinning the entire data modality. And that is something that we really think can optimize the velocity of our analytics outputs. So again, obstacles, small job, lean engineering team, nobody knows how to do anything in Spark, divided data teams all over the place, and these legacy systems. We got to keep these warehouses, these on prem system working while we migrate to a modern data strategy. And so where does privacy come in? I should probably mention privacy and databricks. We have some awesome people on the call with me here from both of these technology services. So we decided to move to databricks as our lakehouse solution, this new modernized lakehouse approach versus the two tier data lake and data warehouse solution.

Alexander Booth

The problem, though, with using databricks is you have to write code yourself. And if I want to do big data transformation, Spark is the gold standard to do that at scale across large clusters of computers. And while databricks allow for that compute to be built and maintained, we don't have to worry about the infrastructure, we still have to write the code ourselves. And so if you ever try to hire a spark developer, you may also know that those are hard to come by, especially for baseball prices. So how can we either upskill our team or hire externally? Both of those did not seem to provide a quick enough solution for what we wanted to do to be able to do these large big data transformations. So prophecy is our technology solution that we identified as almost like a user interface into data transformations on databricks. No upskilling is needed to perform these big data transformation. We simply connect to our cluster on databricks, drag and drop these transformational gems to create a spark pipeline. We're able to create custom and reusable transformations across our domain. So someone on our amateur analytics team can create a custom transformation, maybe something as simple as flattening a JSON BLOB in a more specific way.

Alexander Booth

And we can reuse that transformation across a player development pipeline, for instance. And now both data engineers and analytical engineers, as our data analysts are also engineering their own pipelines, everyone is able to make production pipelines with little to no spark knowledge, which is really important to us. Again, we wanted to keep that responsibility at the edge. If someone on the deployment development team wants to create a data engineering pipeline, I want to support that, but I want that pipeline to be, in turn, deployed into a centralized ecosystem where we all have visibility into what it's doing. And property provides a solution for that. We are all connected to GitHub through property. Every pipeline that we create, even with a drag and drop interface, it generates code behind the scenes, and it generates PySpark code, it generates Gala code, and all of that is committed directly to GitHub. Further, you can toggle between the gem view as well as the code view, and that's huge for us. We want to be able to really understand what Spark transformations are happening, especially as we're still all ups killing in our Spark background. There's no license or black box proprietary pipelines or connections.

Alexander Booth

Everything is well not exactly open source. Everything is committed to GitHub and it's transparent in the code that's actually generated. This allows for ease of integration with CI CD platforms as well as for quality assurance testing. It provides that extra layer of quality assurance to make sure that a pipeline created at the Edge is productionalized effectively. And further, everything is clear with the orchestration, with the lineage. And as we'll get into, it also integrates well with other open source tools. We've also used Airflow as a orchestration tool, which works well with both privacy and databricks. So just some quick numbers here on the immediate impact that we've seen. Three times more analysts and developers are able to create production ready pipeline. It's not just my lean engineering team that has to be responsible for creating all of the data flows. Our analysts are now able to create effective data flows as well in a sustainable, reliable, extensible, productionalized way. Seven times velocity in producing pipeline. From over a week per pipeline developer, now we can roll out a rough pipeline in less than a day, which is awesome. This obviously creates a huge value and speed to stakeholders.

Alexander Booth

Instead of having to wait your two weeks sprint over, now we can roll out our change every 10 days. With continuous integration, we're able to roll out these new pipelines immediately, which means we're able to roll out new analytical metrics, new KPIs that we're analyzing very quickly into our Lake DAO, and then visualize that by reports to our stakeholders. And it all just works, which is awesome. I love when things just work and I don't have to worry about it. So like I said, the ease of integration of databricks in Apache Airflow, Apache Spark, these open source technologies to allow for transparent data pipelines and monitoring has been huge for us. We've been able to grow our fledgling data mesh by millions of records backfilling historical seasons and already ingesting the thousands of pitches that have been thrown in 2023. We're fostering a centralized location where we can communicate as a team about pipeline that centralized workspace on databricks, that centralized workspace on prophecy while maintaining responsibility of creation to the data producers that work on the Edge. Finally, we're going to put our data in the hands of consumers more reliably and efficiently than ever before.

Alexander Booth

And of course, by putting data in the hands of consumers, immediately after a game, we're able to quickly make changes. We're able to be agile in our data strategy. We're also able to be effective in understanding what players are changing their game right now. What players should we target for acquisition? Who is the players you should focus on in player development in the minor League? And of course, this allows us to generate MVPs more effectively and hopefully build a sustainable quality data ecosystem for our next competitive window, which is open right now. Jacob de Groet. Of course, I'd be remiss d to not go through the full talk and not mention machine learning or how AI is implementing baseball as well. I'm going to tack this little slide here on at the end. And again, this also goes in the hand. The hand in hand with privacy and databricks, you want to have a centralized machine learning registry where we can check out which models are being created and for what reason. We can have this dynamic machine learning operations and all of that data flow can be created by the privacy GUI. So this quick graph, this is a hit probability chart.

Alexander Booth

So this is just a classification model that predicts the probability of a battered ball dropping for a hit. And it has some features in it, but here are some of the main ones, launch angle and launch speed. This is the angle of the bat that the ball is hit and how hard the ball was hit. So there's two main areas of red. Red is good, red means it's likely a hit. We have the big BLOB at the right, and those are actually going to be home run. You hit the ball 100 miles an hour, you hit it at 25 degrees, it'll leave the ballpark no matter where you are. But then we have this interesting, almost like a swoosh in the middle between 60 miles an hour and 100 miles an hour. If you hit the ball no matter how hard, between around 20 to 35 degrees, it'll likely go over the infielders but land in front of the outfielders. And that, of course, will likely become a hit. And that'll be an increase in your on base percentage. And this has led to what's been called the launch angle revolution in baseball, where this is even being taught to kids in Little League now.

Alexander Booth

You want to lift the ball in the air. If you can lift the ball no matter how hard you hit it, if you lift the ball between this sweet spot, it'll likely drop for a hit. And just to tie it all back together to money ball and Billy Beane, if you get on base, we win. And one way to get on base is to hit this ball at the sweet spot. And now we have an AI driven model that has led to a new result on the field that still leads to more runs and hopefully to more wins for the team. I think I've already gone way over my schedule time, so I think we'll stop here. But thanks for letting me talk to you guys about baseball. And I think we're going to go into a little round table discussion now as well. So excited to hear what questions you all have for me in addition to everything else. Thank you.

Mei Long

Awesome. Thank you, Alexander. That was awesome. We have a great questions, huge list of great questions. Maybe we'll have a couple of them here. The number one question, I think a lot of folks have in mind is how are the coaches and the players are consuming that data or the reports that you're generating, and what's the dynamic there?

Alexander Booth

Yeah, absolutely. I've been using the more buzzwordy term of data consumers. And baseball, who are the data consumers? The data consumers are the coaches and players on the field, which is awesome. So a couple of quick examples. Before a game, we want to know how a pitcher may be trying to get our batter out. Every single pitch in baseball is almost like a chess game. It's almost like a chess move. You want to be strategic in where you're throwing pitches. You want to be strategic in whether or not you swing at a ball. So during games and really after games and before games, we generate post game reports. We want our players to understand whether or not they were lucky or unlucky. You always see in baseball, Oh, this guy made an amazing home run robbery catch and pulled it back. So that's quantified as an out. But is it really an out or did the player discover lucky in making that catch? So by being able to communicate and tell our hitter, Hey, man. Nine times out of 10, that's a home run. You just got robbed. It puts them back into a good mental state that what they're doing at the plate is still good.

Alexander Booth

But sometimes we notice things wrong. We notice that, Oh, hey, man. This pitcher, he's throwing you a lot of fast ball is up and in, and you're just taking them. Those are always going to be counted as strikes, especially with this umpire. So maybe we should change our strategy going into the game tomorrow. Baseball is a long sport. Baseball is a marathon. 162 games over six months. So being able to quickly adapt and change our strategy immediately after a game going into the next one provides a competitive advantage for the players to be effective on the field. Got it.

Mei Long

That makes a lot of sense. And it really resonated with me is when these folks work with the data engineers and the analysts and the machine learning folks to work together to provide all of the insights and strategies. So one of the things I always wonder is because we all often see in our data field there's a disconnect between the analysts, the users, and the engineers, etc. So do you feel like this low code approach puts you in a place where all of the folks can work together very easily and onboard these data products more easily?

Alexander Booth

Yeah, definitely. One thing that we've used to do back before we adapted to this modern data strategy, which again, I feel like a lot of other companies and every other industry is experiencing too. You almost have an engineering team, you have your analytics team. And if the analysts need some data, they have to ask the data team, the engineering team, to be able to transform it and provide it and load it before they can do any work. And sometimes that requires a two week sprint, and maybe it falls to the bottom of the priority list. And especially with my team, where we've been running a lean engineering team for a long time, there wasn't necessarily that rapid velocity in generating those transformations. And so, again, by putting this responsibility to the edge where these analysts can now also do some of the engineering work themselves, as long as it's moderated and centralized into this larger lakehouse ecosystem, they're able to create that. And again, one of the limitations was that lack of spark knowledge, that lack of effective and reproducible pipeline. So being able to use a low code approach to do that in a centralized workspace, really allowed for our analysts to create these production ready pipeline with just a minimal code review by our engineering team.

Alexander Booth

And this, of course, has led to a lot of parallel development to get a lot more of these data sources and transformations in front of our executives, in front of our players and coaches at a much faster rate. Awesome.

Mei Long

Yeah, that's good to know. So, Franco, do you have any questions for Alexander? Absolutely.

Franco Patano

I've been watching these questions come in, and obviously I had questions for you as well. So one thing that I think is everyone's asking is, we talked about this with fantasy sports data, a lot of people use data to make their decisions all the time. Do you have any plans to make this data, any data externally available? Because I imagine some people might want to integrate with this for their fantasy sports decisions.

Alexander Booth

Most baseball data is actually made public. You can go to a website called baseball savant. You can actually go around the Major League baseball data as well. There's APIs and libraries in both Python and R. There's a package called PyBaseball. There's a package called BaseballR. Both of those will bring in this pitch by pitch data to be able to do your own analysis. And this has led to a very healthy ecosystem called Sabermetrics. Sabermetrics is the public interface into baseball analysis. I would like to say that we stand on the shoulders of these giants, these Sabermatricians that came before us. There's a lot of great research that happens in the public sphere that then just gets adapted in the private club sphere. It's also really interesting though as well, because sometimes the private models that we build on the club side do also leak out into the public sphere as well. One recent example of this has been stuff plus and pitching plus, which is a metric used to quantify how effective or how good a pitch is by things like movement and speed and spin as opposed to the results of the plate appearance.

Alexander Booth

And we've had an internal version of this model for quite a few years now. However, maybe last year, maybe the year before, it got really viral in the public paper metrics sphere. So it's really interesting to see that dynamic how the data created by the club can go to the public and how models created by the public and interface themselves with the club. There's plenty of public conferences out there specifically for baseball data where both clubs and the public alike talk baseball and talk research. And it's really going to push the game forward in how it's viewed through a data driven lens.

Franco Patano

That's awesome. I imagine that open models and open technologies help you out more than proprietary ones. And with that in mind, you came from on prem data warehousing, you were going to the cloud, you were probably evaluating cloud data warehouses as well as Lakehouse. Can you tell us about that and why you chose Lakehouse over other cloud data warehouses?

Alexander Booth

Yeah, definitely. When we first started experiencing this big data revolution in 2018, 2019, where we started adapting our technology from an on prem stack to a warehouse. Again, it goes hand in hand with a lot of other experiences I've had and people in the industry that's just not baseball. You have this on prem giant Oracle database, giant SQL Server database, and it doesn't scale. It's expensive. Now you got to do some stuff with the Classical Data Warehouseing tool in the cloud. We thought that was the solution because the Lakehouse really wasn't a modality that we had heard of or explored back in 2019. But we very quickly fell into the limitations of a warehouse of that two tier system, that very expensive startup cost, that lack of flexibility with different data types, that lack of data transparency and what we wanted to be able to serve by the data markets. And so we needed something different. We did something that if we were investing the money into, we needed that money to be able to scale as well with all these new data sources. We needed to be investing in streaming data, IoT data, the weather data, the post tracking data, and all of these different data sources and videos and mapping everything together.

Alexander Booth

So the warehouses was not cutting it in terms of the cost benefit we were getting for the ability to scale. So naturally go into a lakehouse solution, being able to separate that compute and storage, keeping all the storage for the individual files and using the compute on the lakehouse sites to be able to do those big data transformations became a natural fit for us. And I think that the data lakehouse is only growing in popularity over the last couple of years. And I expect it to continue increasing in popularity as well for a lot of that flexibility. My last note on that too, being able to do analytics and machine learning where your data is, is huge. It's enormous for velocity and optimization and efficiency. My analysts don't necessarily have to be working on these disparate computers. They're local R Studio environments, their local Jupyter Notebooks, we're going to have a more centralized workspace. And since that's also where the data is, it's also quicker to be able to build these models on top of it.

Franco Patano

It's awesome.

Mei Long

We have a lot of great questions. There's a question about demos and such. We're going to have a demo here in a little bit. Also, we're going to address more of your questions by the end of our talk. I think, Franco, I would love to get a sense of from your perspective, from the slow code environment, from databricks, how do you view this architecture? Absolutely.

Franco Patano

Thanks, May. Alexander, that was an amazing talk. Seeing the questions and the commentary rolling as you were talking, obviously, everyone is amazed at all the data that you're using. And you hit on a very few important issues that customers have with databricks. You have to be a code first engineer to leverage the power and the simplicity behind the platform. And with prophecy, actually, I consider Prophecy one of our best partners because they essentially put a graphical, simple to use interface on top of the lake out for data engineering. And they vastly simplified the data engineering tasks that these analysts or these analysts engineers have to do in order to get their job done. And so they're not hardcore engineers. They are analysts that know the business, they know how things work, they know processes, they can conceptualize different things, but they're not code engineers. Prophesy is great. It basically has this gem type of architecture where you find the gem you're looking for, very similar to other graphical nodes on a plane ETL software, and you can ingest all types of data with it and then do all your transformations that you need. It also has built in Git support for CI CD.

Franco Patano

So essentially it has this great integration with databricks where we can pick up that repo and process it as regularly scheduled jobs. And then prophecy can be used to do all that development very simply for your analytics engineers. And basically prophecy integrates the Delta Lake. You can process all of your data on the open Delta Lake, and then it's available for data science and machine learning. Like Alexander advised, he ingests all of this data that comes from all over the place. He takes that data, ingests it in, and then they offer that data up to their analytics engineers to develop data science and machine learning algorithms. They can use MLflow to track all their models, and you can even use MLflow to do real time serving. At the same time, because Lakehouse is the best of both worlds. It's the best of the data lake and the best of the warehouse, you can do all of these things on the same copy of the data. And so DbS qual or Databricks SQL is our warehouse offering. And that's basically the low latency, high concur warehouse serving layer to the lakehouse. And then all of this data that has been curated from the Bronze Silver Gold, or this can be called raw staging presentation layer.

Franco Patano

It's all the same concepts just called a little bit differently. Now all of that data is available for enterprise reporting and BI. And this is essentially how customers can leverage the best of data works in prophecy. And how they integrate, essentially you have that graphical nodes on a plane that has the visual editor. And that visual editor, the claim to fame for prophecy, one of the things I think is the best part is you can transpile or flip between the visual editor and the code editor, which gives you this great ability to know what's going on behind the scenes. Alexander called it, We don't like black box tools, tools that don't let you know what they're doing. This is what we call glass box tools. You can see exactly what's going on behind the scenes with the code, and that code is checked in. Then all lineage is tracked, so anything that you need to do to find where that data was coming from or how it was transformed is all documented and available for exploration. Then even if you're coming from another system, so if you're migrating from on prem, much like Alexander said, the ranges we're doing, we can help you rapidly accelerate the migration of your ETL workloads from your on premise stack.

Franco Patano

So if you're using on premise software to do all your processing, Prophecy has great integrations or tooling to be able to migrate that code to data brick spark codes that you can execute it on databricks. And essentially, prophecy converts your workflows, so your DAG of operations of how your data is transformed. And databricks is your new processing engine in the cloud. So essentially, you get the best of both worlds, and you get really, really cost efficient in high scale data processing, as well as giving your users the ability to do value add tasks like data science and machine learning, and take care of all of your SQL reporting and BI. And with that, I'd like to give you to bring May back to talk to give us an excellent demo of prophecy and databricks. Thank you, Franco.

Mei Long

I love the glass box analogy. This is exactly what prophecy can do is provide that layer of glass box for you all so that you can get started very quickly on the Lakehouse architecture, whether you have a spark experience or not. Usually, this is our bread and butter of our pipeline. What we call this is a gem drawer. Usually, our pipeline starts with a source and this particular source that we're seeing today, let me just create a new one called orders. Sorry, folks, I don't have any baseball data with me today. We're going to go with a more traditional data set that's over here. I do have a nice orders data here that has a CSV and it has a header right here. We're going to parse this out. We're just going to parse this data really quickly here that's on DBFS. Just as Xander had mentioned, this is a separation between compute and storage, so we can scale out very easily. And so now that we have the orders data here, we can set some quick properties here by just infer the schema. And after it's inferring the schema, as you can see, we want the first row to be header.

Mei Long

The next step is to preview the data to make sure that we parse the data properly and this is pretty clean. I'm going to go ahead and create the data set and save it. So this is the first step of bringing some data in and from here it's almost like Lego pieces and you can just start building on top of it. I have the customer's data already configured to save some time here. Basically it's the same exact thing that now that we have customers, we're going to enrich the data sets together. Basically the business question that we're trying to answer here is per customer, how much money did they spend on orders and also how long have they been the customer? By doing that, we're going to need a join gem. We're going to join these data sets together and we're going to join them on Customer ID. From here, if you're familiar with SQL, I can see my input data here, so input zero and input one. I'm going to put a join condition. We're going to join on customer ID and tab is all to complete really easy in one customer ID. Now we're doing a join here and as far as expression goes, I'm going to pick a few of these columns that I'm going to work with downstream.

Mei Long

These are the columns I'm going to work with after the join coming out of the join. This is where you can actually see whether you have any errors. If I have syntax errors, this will tell me and also make sure that if I have any unit testing, I can put it down here as well. Already I can't just run because I think the spark everybody might know is the Lasley evaluated. When I run it, I can see what's going on here. If I'm joining on customer ID, this is definitely something that I don't want to see any notes in. These are a little observation that it gives you as well. After the join, I'm going to do something really simple. What I'm going to do is just clean some stuff up. Clean up. The clean up ultimately is a reformat gem. This is pretty much what you would see basically in a select statement in SQL. For this select, what I'm going to do, I'm going to do some basic transformation. For example, I'm going to get rid of the first name and last name, and I'm going to do a full name and concatenate.

Mei Long

As you can see, a lot of these things are first name and last name together. As you can see, there's an error. That's because I have two parentheses here. For the account open date, I'm going to do a length. I'm going to do a date div on this. All the Spark databricks functions are supported here. And for the div, I'm going to use the current date and the difference between the current date and the open date. Now I don't have any errors here. Everything looks really good. I'm going to save it. Just do a quick run here just to make sure things look pretty good. We have the full name, looks good. How long has it been since they opened the account? The total amount... Well, the amount... Actually, let's do a total amount and do aggregation here just so that we can do some group by. Let's do the aggregation by order sum. So aggregate these things together. So what I can do is for orders, I can do a count on the order. So aggregation is basically anything that requires a group buy in SQL if you're familiar with SQL. So basically I'm going to count the order ID here and customer.

Mei Long

Actually, I'm going to do an amount and sum the amounts together. And for these two, I'm just going to bring them in as is. And then I'm going to group by customer ID is because I want to see the total based on the customer. So there we go. And let's run this just to make sure the data looks good. And this is running basically Spark jobs on databricks. And so things are looking pretty good. I have all the counts. And again, I can't observe the statistics if needed. At the end, what I'm going to do is write it out to a target. Usually the downstream data, you're going to end up with a target. The target, what I'm going to do is customer's orders.

Alexander Booth

The target.

Mei Long

It can be a lot of different things, but for this particular example, I'm going to write it out to a Delta table. The location that I'm going to write it to is make test and let me do a customer border. This segregation, maybe next. And then I'm going to replace this if exists. Let me give it a name. Eppopatient. Just right here and save it. So next now my pipeline is created. One of the things that folks mentioned is they really like the visibility into the code. This is creating a lot of magic behind the scenes and we're not hiding any of this from you all. A lot of these things, every gem that you create ultimately is pretty much just a data frame that you're going to work with right here. So every one of them is a data frame and the code is very easy to read so you don't have to worry about having to parse other people's code and try to understand this. All of this stuff is visual. And then at the end, I can go ahead and run it. And once I run it, this is just a spark job in progress.

Mei Long

And basically if I just go into I'm just going to run this really quick. This is what you will get at the end is what we just run here. Once this is run, what we usually do is we can schedule this. We integrate with the workflow, with databricks workflow and or any of your... This is super extendable. One of the things we want to make sure folks understand with the UI tools is that we want to make sure this is super extendable, which means everything gets committed to GitHub. Every single project in prophecy is in GitHub repo and I can compare this as my branch right here and I can go ahead and basically do a merge and commit a merge into main and then release it to do a CI CD. There's a whole lot, like a huge world behind everything else how we manage everything in GitHub and folks can integrate with whatever Git provider they may have. Also there's data lineage, there is monitoring, there's workflow, there's metadata. There's a lot to go into, but we don't have enough time today. I just want to mention those. If you do have any interest in those, just let me know because I think it's really important to understand that not only can we build these pipelines very easily, we can also share things very easily as well.

Mei Long

By sharing means that if somebody already created a pipeline and it's already in the GitHub repo, like my repo, I can easily bring them in. Let me just put in a... Let's just do a test. I'm just going to create a new project here. Every project is... Sorry, let me do a different repo. Dev, let me change this maybe to Scala. Right now we can do Scala or we can do Python and I can do a continue. Right here is where I can integrate with my Git repo and I can bring that Scala code directly into prophecy. Right here and I can do a complete and every component that I make here can be shareable among my team members very easily. When I open this project, all of a sudden I can see this amazing project that somebody had already built for me, which what they're doing is doing the bronze layer, they're doing the silver layer, and they're doing the gold layer. For those of you that are familiar with the Lake Tahoe architecture, this is basically what's built, and I can drill down to each pipeline to see what's been built and modify on top of that.

Mei Long

These things are called sub graphs for any of the gems that I use very frequently, like for these, for example, if these two I work with very frequently, I can potentially just put them in the sub graph. Once I put them in here, they can be reused many different times by a lot of the folks out there without having to understand what's in there. I can build some really complex and interesting stuff with this. So you import this and then you can create it. This is the MasterF ork. I can create my own branch and modify it and merge it back into main. And so this is all orchestrated and synced up with your GitHub. Everything is visible and there's no proprietary of anything here so that a team can collaborate very easily. In addition to that, I just want to give one minute and I just want to give a quick preview of what's coming up with the new prophecy is we are going to start having support for SQL. So this environment is going to have various models and models is just really a list of transformation that has a select statement out of all of this transformation.

Mei Long

So instead of writing hundreds and hundreds of pages of SQL code, you can just use all the gems that we were talking about and start building out your pipeline, visualize your lineage, try to schedule your jobs and link all of your jobs. All of those are there for you. So this is a new thing that's coming out. We have not officially announced it yet, but just want to give you all a sneak preview on that. All right. So that being said, I am going to hopefully you all enjoy the demo. I'm going to go back here. Ashley. Awesome. Thank you.

Ashleigh

So much, May, and fantastic demo. And speaking of demos, as we round out our session here, we do want to invite all of our attendees to continue learning about prophecy in the low code offering that we do have for the Data Lakehouse with databricks. So please go ahead and book a one on one demo with prophecy. The first 10 people that do book this demo right here, right now, you will win a Texas Rangers bobblehead, again, courtesy of Alexander and the team over at the Texas Rangers. So go ahead and book that demo. That link is in the chat. And again, this is a great opportunity just to get a tailored demo, again, for your use case. And we'll sit you down with the technical expert and go through that. So again, a great opportunity for you guys and take on more swag. And then just going over to this next slide here, again, just echoing again, what May went over in her demo, the best way to really experience the power of prophecy for databricks, for the data lake house is to begin with a free trial. So that link is also here in the chat.

Ashleigh

Again, no strings attached. Just hop in and enjoy. We will be reaching out to you guys just to see how we could make that trial experience even better for you guys. So would love for you guys to again have a demo and start a free trial. So again, both links are in the chat and we'll also be sharing that in follow up emails here very shortly with the recording of this session. And with that, everyone, please join me in thanking our awesome speakers for today's session. This was truly incredible. Thank you, Alexander, Franco and May. We're so delighted to have you guys and thank you all and have a great rest of your day. Thanks, everybody.

Speakers

Alexander Booth
Assistant Director of Research & Development
Franco Patano
Lead Product Specialist
Mei Long
Product Manager

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free