I'm now living in an awesome condo with an amazing view of Austin. I sit on my balcony all the time. I laid down that fake grass turf and cute little hippie decor and strung it with solar lights. It's my favorite place in my apartment and a no work zone -laptop not allowed.
Last night when I was out there watching the sunset flight of the bats over the Congress Ave bridge, I thought about how few times I see other people outside on their balconies. Sad! But then I don't sit out there all day, so maybe they just don't come out at the same times I'm out there.
I started thinking about how balcony furniture could be an indication of people who use their balconies (let's ignore how often). So here's the data science problem: what % of people use their balconies? There are 430 condos in my building. So, if we could count the number of balconies with furniture, we could find out what % of people use their balconies. Here's where sampling a population comes in: do we need to count EVERY apartment to get an accurate measure of balcony-goers? I can only see some balconies from my place, and I am not gonna go walk around with binoculars counting each apartment, not looking to be labeled creepy. ;) But luckily, we CAN get an accurate % by just observing enough of the balconies. This is the Central Theorem, an intuitive theorem that says if we sample enough balconies, we can still get an accurate answer to the % of the whole building.
Alright, so how we do this? How many do we need to sample? Generally 10% is sufficient. Accuracy will go up with more samples, but it can be extra work with diminishing gain. So we are gonna sample 10% of the condos (43), but we have to make sure we RANDOMLY take these samples. Even if I could see 43 apartments from mine, that wouldn't be accurate, because it isn't random. I can't see lower apartments, and this building has east and west facing units. Why is that important? Well, it's possible there are different % percentages for lower units; maybe they don't have a great view. Maybe the %s are different for east and west. That's ok. We just want an overall figure. Our 43 samples must be randomly selected from the entire building.
Alright, so let's say I was able to casually and not creepily get 43 samples selected from the 1st to the 44th floor. Hypothetically, let's say I found 27 with furniture, 16 without. Great, 62% of people go outside (27/43).
But should we stop there? Do you see any problems we might have with this figure? We listed in our assumptions that furniture indicates those people go outside, but what if a portion of those without furniture go outside? Maybe they throw out a towel to tan or go out there to smoke. Our calculation also assumed that every unit was occupied. Ok, so let's say I go ask the building manager how many apartments are occupied. Let's pretend she tells me 380 of them are occupied. Now we can adjust our calculation.
380/430 = 88.3% occupation rate. We know the 27 balconies with furniture are occupied because they must remove their stuff on vacating, so we have to adjust our non-balcony figure. We saw 16 condos without furniture, but these are the ones that might not be occupied. So, we take our 16 condos by keeping 88.3% of them. Let's do it: 16 condos x 88.3 = 14.14, so we're gonna call it 14. That means from our 43 sample size: 27 with furniture, 14 without, and 2 unoccupied units. Since we only observed 41 occupied apartments, our new calculation is 27/41 = 65.8%, up from our previous calculation of 62%.
This is a simple example of sampling, but it comes up ALL the time. The Census takes a count from people every 10 years, but in the years between they still wanna know what's going on, so they are sampling a population. Nielsen uses a sample audience to get tv ratings. When they are trying to count fish in a lake, they tag a portion of the fish then later catch a sample of fish and determine how many have tags. If they pulled 10 fish and only 1 had a tag, it tells them the portion they tagged is only about 10% of the total number of fish. Do you know when it looks like a fast moving car's wheels are moving backwards? That's because our eyes aren't sampling fast enough to see what's actually happening, not catching the frames that show us the wheels are obviously moving forward. I had a cool project at McLane that is now in production where sampling accuracy was the central tenant.
Hope you enjoyed learning a little about sampling!