Using Cron to Automate Running Python and R Scripts

When you’re working on data science projects, you might have scripts/processes you want to run each day. For example, i’m working on a project to model how full trailhead parking lots are based in part on weather. I want to collect the weather forecast each day so I can use it to make predictions. Instead of manually running the script every day (which I would probably forget to do anyways), I set up a cron job that automatically runs the script each day. Cron can also be useful to automate other processes like backups etc..

Read More

Citibke Part 3 : Modeling

This is the third and final in a series of blog posts i’ll be writing on a project to analyze NYC Citibike data. The first post focused on collecting and preparing the data, and a strategy for dealing with a very large data set. In the second part, I did some EDA (exploratory data analysis) of the data. In this final post, i’ll focus on modeling with the aim of predicting the number of Citibike rides taken on a particular day.

Read More

Citibke Part 2 : EDA

This is the second in a series of blog posts i’ll be writing on a project to analyze NYC citibike data. The first post focused on collecting and preparing the data, and a strategy for dealing with a very large data set. In this part, i’ll be doing some EDA (exploratory data analysis) of the data to identify patterns and inform modeling, which i’ll detail in part 3 of the series. There are a ton of different questions I could ask/explore in this dataset; i’ve chosen to focus on the number of rides each day and the factors that affect them. My ultimate goal is to try to predict the number of rides taken across the system for a give day. Below i’ll show some of the findings from my EDA; you can see my EDA notebook here.

Read More

Citibike Part 1 : Data collection and Preparation

This is the first in a series of blog posts i’ll be writing on a project to analyze NYC Citibike data. In this part, i’ll discuss getting and storing the data, and my workflow for analyzing a very large dataset. I’ve written about some of the pieces of this workflow previously, and will include links to those posts where appropriate.

Read More

Tracking Our Water Usage

Introduction

Our water service recently changed to the Consolidated Mutual Water Company CMWC, which covers a lot of the Lakewood/WheatRidge area west of Denver. Unlike our previous service, they record our water usage at hourly intervals, and give access to the data from their website. I love tracking things and collecting and analyzing data, so I was super excited to find this out. This is a quick first look at that data and patterns of our water usage. I’ll be using R for the analysis, with the packages dplyr, ggplot2, and lubridate.

Read More

Winter Is Coming (with higher energy bills)

Winter is coming, but for us that means higher energy bills, not whitewalkers :) . We’ve had a few unseasonably warm days and not much snow yet, but it’s definitely colder and we are now using our heat almost every night. I thought it would be fun to explore our energy usage data in R.

Read More

Clear Creek Rising: Analyzing USGS Stream Gauge Data

One of my favorite things about Golden is Clear Creek. It runs right through downtown, and is lined by walking/biking paths and parks. Most of the year it is an idyllic, slow flowing creek perfect for tubing, swimming, and letting your dog cool off. On some hot summer days, there are so many tubers it looks like a lazy river at the water park.

Read More

Amazon S3 and Python

Amazon S3 is a cloud storage service offered by Amazon Web Services (AWS). You might use S3 as a backup for your files. You can store data on S3 so you can access it from any device if you work on multiple computers. You also might want to use S3 to store data for a web app like Dash or Shiny. A nice thing about S3 is that you can interact with it (and other AWS services) via python using the boto3 module. I’ve been learning how to do this recently, and below I compiled examples on how to do some of the most common tasks.

Read More

Using a SQL database w/ Python and Pandas

When you’re working with small enough datasets, you can simply load the data into memory and work with it in python. But sometimes you have so much data that loading it into memory is either impossible or very slow. In that case, storing your data in a SQL database might be a good option. A SQL database allows you to run queries on large datasets much more efficiently than if the data was stored in csv format.

Read More

Simple Gradient Descent

If you’re learning data science and especially machine learning, you’ve probably heard of gradient descent. When I first heard it, it sounded a bit complex and intimidating. I knew it was some sort of optimization method, but imagined it involved some pretty complicated math. However, it turns out it is actually pretty simple. The purpose of this post is to just explain the basic concept in simple terms and show you that you can easily understand it.

Read More

You should be using dplyr and the pipe!

Dplyr (https://github.com/hadley/dplyr) is a great R package for manipulating, exploring, and summarizing data frames in R. As far as I know it doesn’t do anything that can’t be done in base R, but it makes all the common data analysis tasks easier, more readable, and faster. If you work with data frames in R, I highly recommend learning dplyr and how to use it with the ‘pipe’ operator. To get started, check out the vignette, or the great course on DataCamp (https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial).

Read More

Building a Shiny App for Data Exploration in R

I just finished the “Data Products” course, part of the JHU Data Science Specialization on Coursera that i’m working on. As part of the course, I created a “Shiny” web app in Rstudio. Shiny (http://shiny.rstudio.com/) is a pretty easy-to-use framework for turning a R analysis into a web app, without knowledge of html/java/css (though if you do know those, you can easily customize your app and make it even better). It seems like a really nice tool for sharing your analysis and/or for visualizing and exploring a dataset. You can host the app at shinyapps.io with one click from RStudio (you are allowed a certain number of free apps with usage limits, and can upgrade to a paid plan). You can check out my app at : https://andypicke.shinyapps.io/ExamineNYCFlights13/. It allows you to explore the nycflights13 dataset in R.

Read More

Using R Manipulate for Data Exploration

Recently for my research, I needed to investigate the effect of varying thresholds for several variable on the distribution of a new variable. I had been working in Matlab and had a script to plot the distributions given specified thresholds, but I was frustrated that I couldn’t do this more easily. To see the effects, I would either keep changing the thresholds manually and re-executing the script, or could write a loop to cycle through values, but neither of these options seemed great. Fortunately, i’ve been learning R and heard about the ‘manipulate’ package, which seemed like the perfect solution. Manipulate (https://support.rstudio.com/hc/en-us/articles/200551906-Interactive-Plotting-with-Manipulate) allows you to make a plot and alter variables via a slider or checkbox; every time a variable is altered the plot is re-made. Now I could simply adjust my variables via slider bars and instantly visualize the effects on the distribution. The code example below creates a function to plot a histogram of ‘gamma’ which depends on the variables eps,n2,and dtdz. Only values of gamma where those variables meet a threshold criteria are plotted. Then I call manipulate to create this plot with the interactive sliders.

Read More

Weather and Weddings

Kate and I are planning to have our wedding in Colorado, so I thought i’d look at some historical weather data and try to figure out the best dates (weather-wise). Check out the results:

Read More

Weather and Bikes

I recently attended “Transportation Camp” in Boulder. Motivated by that, I decided to analyze some data from Denver’s Bcycle bike-rental program. Since I know that for me the weather is probably the biggest factor in whether I ride my bike, I wanted to see how much of an effect it had on Bcycle rides in Denver. You can check out my analysis here: https://github.com/andypicke/Bcycle/blob/master/BcycleDenver.md.

Read More

I'm up and running!

Got my new webpage running! I decided to use github pages w/ Jekyll. Found a super helpful guide at https://www.smashingmagazine.com/2014/08/build-blog-jekyll-github-pages/ that got me set up in a few minutes. The author of that post made a template repository https://github.com/barryclark/jekyll-now; all I had to do was fork it to my repository and edit the config file. Good things about this approach:

  • Easy to start (literally 2 minutes).
  • To add new posts, I just make write a markdown file. Github pages takes care of converting to html etc.
  • No need to download and configure all the software and dependencies needed to build the pages locally.
  • No databases needed.
Read More