Butler to Philly: Find the Best Butler Service in Philly Now!

Alright, so today I’m gonna walk you through this little project I tackled: “butler to philly.” Sounds kinda fancy, right? It really wasn’t. It’s just about moving some data around, but hey, gotta make it sound interesting, yeah?

First things first, I had this data sitting in a Butler database. You know, Butler, the thing they use for managing LSST data. The goal was to get that data – specifically, some source catalogs – over to a more accessible format, Parquet, and store it where I could easily grab it in Philly (that’s the NERSC supercomputer, in case you’re not in the know). Why? Because I wanted to run some stuff on it, and Philly’s way faster than my local machine. Plus, all the cool kids are using Philly.

So, the very first thing I did was figure out how to even access the Butler data. This meant digging through the LSST documentation, which, let’s be honest, can be a bit of a rabbit hole. After some trial and error, I managed to get the Butler set up and pointing to the right location on disk where the data was chilling.

Next up was the actual data extraction. I wrote a Python script using the `*` package to read the source catalogs. This involved iterating through the datasets, specifying what I wanted (in this case, source catalogs), and grabbing the actual data frames. It was a bit tedious, mostly because dealing with large datasets always is.

Once I had the data frames in Python, I converted them to Parquet format using `pandas` and `pyarrow`. This is pretty straightforward, but you gotta make sure your data types are all kosher. Parquet is picky about that kinda stuff. I ran into a couple of snags with data types, but a little bit of type casting sorted that right out. And compression! I used snappy compression, because, you know, gotta save on space.

Then came the transfer to Philly. This was probably the easiest part. I just used `scp` to copy the Parquet files over to my account on the NERSC supercomputer. You could use `globus` if you are dealing with larger datasets though.

Now, once the data was on Philly, I wrote another Python script to verify that the data transfer was successful. This script simply reads the Parquet files and prints out some basic statistics, like the number of rows and columns. Just to make sure everything was kosher.

I also set up a basic data pipeline to automate this process. I use a simple `bash` script that runs the Python extraction script, converts the data to Parquet, transfers the files to Philly, and runs the verification script. Not super fancy, but it gets the job done, ya know?

Here’s the key takeaway: This whole process wasn’t about some super complicated algorithm or mind-blowing data science. It was about taking data from one place, transforming it into a more useful format, and getting it to another place where I could actually use it. Data wrangling, data engineering, whatever you wanna call it. It’s a huge part of any data science project, and it’s something you gotta get good at.

I hit a few bumps along the way, like the data type issues and figuring out the Butler API. But that’s part of the fun, right? You learn something new every time. And now I have a pretty streamlined process for getting LSST source catalogs onto Philly, which is gonna save me a ton of time in the long run.

So yeah, that’s “butler to philly” in a nutshell. It’s not rocket science, but it’s a practical example of the kind of stuff you end up doing in the real world. Hope it was helpful!

Latest Posts

Butler to Philly: Find the Best Butler Service in Philly Now!

RELATED ARTICLES

Latest Posts

Don't Miss

Football

Rugby

Baseball

Basketball

Contact us