Welcome to another post in our Azure Every Day mini-series covering Databricks. Are you just starting out with Databricks and need to learn how to upload a CSV? In this post I’ll show you how to upload and query a file in Databricks. For a more detailed, step by step view, check out my video at the end of the post. Let’s get started!
- First, be sure you have Databricks open and a cluster up and running.
- Go to your data tab and click on add data, then find and upload your file. In my case, I’m using a set of sample data made up of values of people’s names, gender, birthdate, SSN, and salary.
- Once uploaded, you can click create table in UI or create table in notebook, I will use the latter for my demo. This creates a new notebook automatically with some of the code prepopulated. It lets you know exactly where the file is, as well as showing that it successfully detected my CSV and that its delimiter was commas.
- It then asks me to attach my running cluster to the data frame that it has created. When I run it, I notice that because of the default settings that say ‘first rows header false’ that my headers are not put there. I can fix that by changing that to true.
- I also notice that it’s assuming that all the columns are strings since the column headings were in there, so the first row is a string which in turn makes everything a string.
- Next, I’ll try an infer schema so if this does come out correct, I can go on and do some analysis. One thing I would probably do is take the birth date and parse out the years and month to do some analysis that way. But I notice that it doesn’t catch the birth date as a date time but calls it a string. The other columns are defined correctly.
- I will then add another cell and take the same code they used except I’m going to add in a schema. I need to call on PySpark SQL types; the types that you need would be any ones that match the data types we’re going to be using. The struct type is the structure type which you need for declaring a schema and the struct field is the structure for each column.
- The struct field requires 3 items, column name, data type and an indicator as to whether you’re allowing nulls. In my case, I’m leaving it true for all nulls for simplicity.
- In the next section, you’ll see that infer schema is now false as I’m declaring the schema. I’ll still use my Spark read format and my format or file type is CSV. What we’ve added to the previous version of forming the data frame is our schema, which is to use the people schema.
- I hit shift/enter and it pulls back the same data but now we see that the birth date is the time stamp which is what we want.
- Now, let’s do some analysis on that. My first query will be an aggregation that will take and group by the year from the birth date and then I want the average salary column. What I want is the average salary for every birth year.
- Once we’ve run that we now have some useful data to do some analysis on. When you hit the plot button, you have options for different plot styles (graph, line chart, etc.) and you can customize the plot. So, in my example, I want the salary average and the values, and I want the birth year along the Y axis.
- Another thing you could do is switch to SQL. For this, you’d want to give the name of a temporary table. You would use this table only in this particular notebook and session, then it will disappear. This will default and call this people CSV.
- When we have created this temp view, I can use % SQL to tell it to switch the SQL and then I can do select star from people or any other SQL query. We’ll still have the plot options so we could create a bar chart and do similar analysis as I had previously done.
I hope this helped to show how simple it is to upload a file and bring it into Databricks as a table and do some quick analysis on it, plus easily create a graph of the analysis. If you’d like more information or help with working with Azure Databricks, you’re in the right place. Contact us and our team of experts will be happy to help at 888-8AZURE or [email protected]