Samuel Taylor – Blog

How to Join a New Team and Learn a New Codebase

Samuel Taylor — Sat, 07 Nov 2020 15:27:50 GMT

Alternate title: new codebase who dis?

Delivered at:

Southern DevFest 2020. Slides available here.

Find me on Twitter @SamuelDataT.

Transcript

0:00
Alright, yeah, thank you all so much for being here. Great to be here. Let's get started. When I started studying computer science in college, it's kind of I realized that the software industry is kind of like this pizza. Stay with me, I know, that's a weird statement. When I was in college, I'm basically looking at this pizza, I can see this little triangle. And I see that there's some pepperonis on it. And these pepperonis are kind of like, the things that I knew about at the time. So I'm thinking, when I go into the software industry, I'm going to be all about working on figuring out how to balance abl trees and do quadratic versus linear programming and hash tables, and all this stuff that I was learning in my classes. And all of that stuff was super interesting, I think, really valuable. But what I realized was, once I got into the industry, there was actually a little bit more like this, the software industry is is this pizza, that we really have a lot of other things in it that are not just crazy, algorithmic, really interesting algorithm problems. But there's a lot of there's a lot of other stuff in there, too. And so this part of the pizza that has just cheese on it, it's still really great, still really important. But it's things like knowing how to set up Jenkins builds, knowing how to configure Iam properly, figuring out your build tools, like understanding how Maven works, all those kinds of things are all super important. And without them, you would not have the pizza basically. So that's what I want to talk about today is one of these particular areas, that is not a pepperoni, we're going to be talking about how to join a new team, and specifically how to learn a new code base. I think that it is really difficult to teach this. And I really wish that when I had been in school, someone had at least tried. So that's what I'm going to try to do for you in the next little bit of time, we'll see how it goes. I think there'll be plenty of tips in here for you, no matter whether you are a new developer, if you are relatively experienced, I hope to have a few tips in here. That'll be helpful too. I know there's a lot of other really interesting talks. So before I jump in, let y'all have a chance to there's other cool tracks happening. Also, if you don't want to learn about finding a new codebase, now's the time for you to leave, because that's what we're gonna be talking about today. This is the way we're going to do that. Right now, you can probably tell that you're in the introductory period. After that, we're going to talk about what I think you should do day one when you join the team. And then we're going to talk about the mindset that you should have when you are reading and writing code before moving on to talking about the process by which I recommend and I try to use always when reading and writing code. And then finally, talk about some tools that can make this all a little bit easier. So on day one, there are three things that I personally believe you should do. The first of those is to set up your development environment. And when you're doing this, you should make sure you are really paying close attention as you do that. There's a lot of hints in this process that can help you understand what sort of world you're getting into on this new team. And as you develop in this new world. So that you should try to understand what services you're running as you get your development environment going. For instance, if you're suddenly running Redis, you might think to yourself, ah, we have some sort of caching layer, I wonder what that's about. This can also give you some hints as to what dependencies exist between services. So for instance, if you have one project that needs to be running before you get another project running, that gives you a hint that that second project probably needs the first project to run and there's some interrelation there. You should also when you're doing this, take specific notes on what you're doing. And you can do this on a piece of paper, I find it really helpful to just open up some sort of note app on my computer and literally copy and paste in commands that I'm running. I think anyone, I tend to sort of come from the Python world, and anyone who has ever tried to set up their dev development environment has always found it really helpful to know exactly how you screwed up your system Python installation later. So it's good to make sure that you know exactly what you're doing. The last thing I would say regarding setting your development environment up is ideally the project that you're working on will have some sort of test suite. And if you can get that test suite running and make sure that all the tests are green, that probably means that you've got things set up correctly. So that's just a win on its own. But once you've got that all those green tests going, the next thing that you should try to do is break those tests, literally open up a random file in the project and just delete some lines and see what tests break. This will start to give you an understanding of what pieces of code relate to what other ones.

4:36
The next thing you should do on day one is ask some senior members of the team to give you some sort of overview of the architecture that you're gonna be working on. The first thing that you can do here is try to find some sort of document or diagram it doesn't always exist. But ideally, there's some sort of wiki page that you can go to and see a drawing of kind of the way that things are laid out. And when you do that, make sure you check it out. When the page was last edited, because sometimes and in in some places, in some situations, the architecture of a system can change pretty rapidly. And if you're looking at something that is old, it might not relate to the current world. And that would be really sad for you to figure out the wrong things. You should ask them a lot of questions just ask, like your team. In a healthy team that is functioning well, people will be excited to answer your questions, because they know that answering the questions is going to help you become a more effective teammate, which even just for selfish reasons, when I have new people joining teams that I'm on, I want to give them as much information as I possibly can. So that way, they can start doing some of this work, and I don't have to do the work. That's the whole goal. So don't ever feel bad about asking questions. A few questions you should think about asking, what repositories do we own? What things are we working in frequently? Are there any repositories we share with other teams? How is this all structured? another great question to ask is, how does a feature get from running on my laptop to actually being visible in production, this will give you some really good insight about sort of the the like deployment pipeline, which can be really helpful to know about, depending on what size of company you're in, you may need to know a lot about that, or you may not need to know very much about at all. And you also will want to ask what kind of vendors what API's are we relying on so that you can get a sense for what those things are at really helpful early, when you can provide to the rest of your team is to at the end of this process, create a new architecture diagram, or at least update the existing one, if it does already exist. This is a nice way to sort of get build some early goodwill with your team. And you can roll up to everybody and say, Hey, I made this architecture document, I believe it's accurate based on what I've heard, here's what I have. And this can be something that's really helpful not just for the other people on your team, but as well for new members who will probably join your team in the future. The third thing that you should do on day one is figure out what the business does. Because if you don't, you are completely doomed, you will never be successful. If you don't understand the business, at least to some degree. It's good to know what the mission of the company is, what sort of products it offers, and the goals that it has different companies will do this in different ways. They will use different goal setting frameworks. But it's really nice to know what are we trying to do as a company. Once you know that, you need to know how your team is going to contribute to that. So is it you know, we work on this part of the product, which helps us achieve this specific goal. And the way that we can achieve that specific goal is to you know, improve conversion rates over here, something like that. Great questions to ask include, how can we impact the goals of the company who will get mad if our code breaks, just horrifically, this is a really useful one because it not only tells you who relies a lot on your on your software, but it can also give you a sense for how careful you should be. Obviously, it's never a great idea to break production. But it's definitely good to know if I've production for for one minute, am I going to lose the company one cent, or $1 million. There's a significant difference between those two things. And understanding those trade offs. And there's the the context in which you're working is very important.

8:18
Okay,

8:19
that's day one. The next thing that I want to talk about is the mindset that you should use when you are working in a new codebase, particularly in a new team. This can all be summarized, as learned by doing. There's a book that I read recently that I really enjoyed called ultra learning by Scott young. This book is sort of about how to teach yourself things, Scott young as this sort of autodidactic figure who has spent a lot of time thinking and writing about productivity and learning. And he's done a bunch of crazy things, including going through all four years of an MIT computer science curriculum in one year. And so then he wrote this book about people who do those kinds of crazy learning projects. And it's really interesting to hear about some of the principles for learning effectively. One of the ones that resonated super strongly with me, is the idea of directness. Scott young says, The easiest way to learn directly is to simply spend a lot of time doing the thing you want to become good at. In this case, we want to spend a lot of time working on this team. We are joining the team to produce valuable software. And so what you should do is just start doing it, start working on the software, start building the stuff. That's the best way to get better. When you are making an impact. That's the best way that you're going to be able to gain any kind of deep understanding. What you want to make sure you're doing as you develop and read and understand is understand the code that you are trying to work on well enough that you can make the change that you need to make to have the effects that you want to have to make an impact on your team and on your product. What you don't want to do is read every single line. I think this is a trap that can be sometimes easy to fall into is thinking I needed To understand every part of this program, I need to understand every part of the architecture before I start messing with things. And that was something that I sort of felt like at the beginning of my career, and now understand that, you will gain that understanding over time. And actually sort of counter intuitively, the fastest way to gain that understanding is going to be by doing the work itself, when you're reading stuff, without any sort of directed goal of trying to say, um, you know, I need to make our scoreboard service predict, you know, be able to produce 2020 high schools instead of 10. That gives you sort of a goal to work toward and makes your brain think about things harder, and make sure that you're understanding things better.

10:43
When we talk about reading code, I worry that sometimes we we've overloaded the word read too much reading book is incredibly different from reading code. And I don't know that that's necessarily, really, the differences between those I don't think are elucidated strongly enough. For an example, imagine if the Lord of the Rings series was written as a software artifact, you would get things like, Okay, well, there's not three books that are, you know, this long, you would have maybe 1000 books that are all four pages long. And you'd have a case where some of these, you know, little tiny books are bound together. And some of them have like, there's this one book that has all the magic in it, except for any magic that's related to grass is actually over in this one for some weird historical reasons. And actually, all the sword fighting also happens in that one, except for actually sort of fighting with magic is back in the magic one. And you can see how this would get very complicated very fast. In reading a book and writing a book, you're trying to sort of create a coherent narrative for people to follow. And ideally, when you are writing code, you are also trying to create a coherent narrative. But the structures we have for doing that are dramatically different. When when we're writing code, we don't just have one very long file. Ideally, if you do have just one very long file, that's not a great time, ideally, you're able to break these things up into smaller units that can then be understood more easily. And that's something that I think is an interesting concept, that your knowledge will grow in sort of a recursive way. So you start at the top. And on day one, you know, you're getting this architectural overview, you're understanding what the various services do. And then you're going to get a ticket. Ideally, ideally, your team has sort of queued up something or teed something up for you, that would be a good fit for a new person to help them learn some pieces of the system, and you get this new ticket. And so now you need to figure out, Okay, I'm gonna have to modify this service in order to be able to get this thing done. And then you're going to dive into that service and say, Okay, here's the sort of modules that are in that service, or here's a little you know, what, however, the service ends up being organized, visit with these different things are and sort of, more or less what they do. And they can try to understand, okay, I think I'm probably going to need to modify this particular module, what are the classes that are in here? What are they all doing, and you can sort of progress down this path, until at a certain point, you get to an individual line of code. I think, if you're writing code, reading a single line of code is very, it's probably very easy for you unless there's some like weird syntax that you're not familiar with reading the line is relatively self explanatory, you can see a line and say, This line is incrementing by one, or you can say this line is creating a new database access object. That's not particularly complicated. It's all the steps above it that are hard. And what I would like to say is that what you can do in this sort of recursive pattern, is when you get down to understanding the level of individual codes, the next thing you should be doing is creating chunks. And what I mean by chunks is this concept of neuroscience, where chunking sort of refers to the brain's ability to bind detailed information to a concept that is easy to remember. So for instance, I don't have to know that a particular function call gets a handle to the database, and then builds a query that it wants to run includes certain parts of the query based on the parameters passed to the function, runs the query handles errors, comes back with my result, modifies it in some way. There's, there's all these very detailed steps that are happening inside this function call. But ideally, at the end of the day, I can see the function and say, Okay, I know what all the detailed steps of this are, I know that this function is getting high scores. And that's a much easier concept to remember and think about, and reason about, and stick inside of a human brain than trying to remember 12 detailed steps 12 individual lines of code all at once. In other words, all we're trying to do here is make sure we can see the forest for the trees, we don't need to focus on every individual tree, every individual branch on that tree, every individual pine needle, we need to see is that's a tree, or we need to know is this function gets high scores. The last tip I have in terms of mindset is to think about this in two different ways. And there are a couple different ways to think about this that I at least find useful. The first one is To think about it, sort of in terms of code pads, and so you might think, Okay, the first step is that somebody makes an HTTP request to slash puppy. And then that calls my get puppy method that I have, you know, my routing setup to go to, and then get puppy uses my puppy manager object to get a random puppy and then inside get random puppy, we're actually running this query through our database accessor. This is one way that I find really helpful to think about these things. And it is also really helpful again, to draw these little diagrams as you're going along can be really useful.

15:34
The other way to think about this is the sort of the way that data flows through the system. And different systems will make more or less sense to think about in one or two, one of these ways or the other. And so when you're in a situation where you're not sure, try to do both, and see which one helps you the best is what I would recommend. When talking about data flows, I mean, maybe we think about the scores living in some database, and then that flows into a score data access object that happens to know information about scores. What's interesting to think about is what objects know what things at what point in the program, and how does that data flow through the system. So maybe scores are known about by the score data access object. And then the score controller might use score data access objects. And then finally, that information by might end up getting passed through to a some sort of front end client, a scoreboard, j. s, for example, this is a bit of a contrived example. But hopefully, the idea is clear. So that's more or less the mindset. Actually, before I talk about process, there's one more thing on mindset that I've forgot to make a slide for. Sorry, just have to listen to my dulcet tones. And one last mindset thing is, you might not be wrong. And I think sometimes when we join teams, or at least, I know that I have struggled sometimes when I join a team to assume that the team I'm joining has gotten everything perfectly right. And they've thought through everything super well. had any suggestions that I might have? are, are foolish in some way. And I think that is a bad instinct to have for a number of reasons. Firstly, because no one is perfect. So the team might actually have just made a bad mistake. And you being able to bring up Hey, is there a reason we're doing this in this way I've been used to seeing it this other way, can be really helpful. And we can help spur really good discussion, and can also help lead to new insights. For instance, if you have like some particular way you really think is the best way to implement a singleton. And you bring it up with your team. And they're like, Oh, actually, we've been doing Singleton's this other way, then at least one of you gets to learn something by the end of it. Either you learn that your way is not optimal. Or they learned that your rate is optimal. And But either way, there's learning happening. And that's a good benefit. So don't think that you're always wrong about things, definitely bring up your concerns and your thoughts and try to learn stuff. It's very helpful.

17:57
Okay, let's talk about process. And the short version of this is that I think the scientific method is one of the most impressive sort of achievements of the human species is being able to understand how to gain knowledge about the world in a scientific way, I think is incredible, that we have figured out how to do that. And so I think we would be remiss not to use that same process when we are reading and writing code. So when we are we get a ticket in this is roughly the process that I think we should follow, the first step is going to be to find out what code is relevant. So if I'm trying to work on something about a scoreboard service, I probably don't need to worry about user authentication that just isn't particularly relevant. So we need to do is identify what that relevant code is. The next thing we need to do is form a hypothesis about what we need to change. In order to do this, we're going to have to understand the code that we're looking at and working on well enough to actually form this hypothesis. Next step is going to be to test your hypothesis. So make your change and see if it was correct. At this point, if there's any people who are really into test driven design, in the YouTube comments, I'm sure you're blasting off about how like, test driven design is the coolest thing ever. And it's really useful. I completely get it, there's a school of thought that says, even before you make your change, you should write a little test that makes sure what you think is going to happen, it's going to happen. And I completely agree that that is a really useful tool to be able to quickly understand whether your hypothesis was correct. And even even before you make your changes, it's nice to have that specifically written out. And still at times, it can be difficult to do this. And so it's also completely acceptable, in my opinion, to just make your change and then manually verify whether it worked. Now, just because you verify that it works doesn't mean that you're done. The last step is to improve the quality of whatever it is that you've just written. So if you didn't write a test at first, writing tests now is super important. If you avoid writing tests for your code, the code That you write immediately becomes legacy code, and very difficult to maintain. What's going to happen if you don't write the assess is later something's going to change, and you're not going to things are broken, where they, when you should have already known that from automated testing, or you're gonna need to refactor is going to be a huge pain, right? test, please. And then other things like improving the legibility and maintainability of your code, I think are really important to do at this stage, make sure that because code is read far more often than it is written, make sure that you're optimizing for the reading case, I know that it's it's sometimes convenient to write code, and leave variable names as I or, like, have really short abbreviations for function names. I know that's very nice for the writing case. And that's often sort of optimizing for the writing case, but code is generally written once, and then it's going to be read, you know, multiple times, potentially 10s, hundreds, thousands of times by, you know, many individuals potentially. So you want to make sure you're optimizing for the reading case, and making sure that it's very easy to read and understand the code that you're writing. And that's what you should be doing in this last step. Okay, let's talk about tooling. And there's a few different sets of tools that I think are useful. But as far as why I think these tools are useful. Let me say that writing code is one of the I would say sort of developing software is generally the most difficult part of the job as a software engineer, at least in my experience, is one of the most mentally taxing parts. And what that means is any when we can get in terms of decreasing the mental load, we have to be under when we are doing this is a huge win. because it not only means that we, you know, the job is just easier, which is a nice thing. But if you think about our level of our level of skill and our level of ability to execute on something, if we're maxed out at our at some sort of level of skill, of course, we can grow. But our sort of Max skill is by practice over time. But at a certain point, if we are maxed out against we are only capable of doing so much. And then one easy way, or one way that you can sort of be able to achieve more is by using tools that help you offload some of that cognition elsewhere. And so that's the way that I try to think about tools is they are basically, you know, it's kind of like the old Steve Jobs code, it's a bicycle of the mind, it's going to help your mind to go faster and be more efficient. And these tools are really great. They're not a replacement for thinking, but they can make the process far easier. You certainly could get by with just using like grep, and a text editor, and you could probably figure out everything. But it's going to be a lot harder and a lot less efficient than using some of these tools.

22:51
Um,

22:52
I kind of think about tools in a number of broad categories. The first category being this the step when we're trying to find what code is relevant, there's a few things that we should do there. The first is to just run the code. So like, if you have some sort of, like web application, just get it running, and understand what exists, go click on the page that you're supposed to add a button to try to understand, where am I going to be adding this? Do I need to create the whole new page? Do I just need a button somewhere, understanding the context that you're working in is going to really help you when you're writing this code. And another tip that I would say, didn't occur to me until far later than I care to admit, is that using the debugger is not just helpful for when you found a bug and you're trying to understand what's going on. It's also super helpful to do for completely working code, because you can run your program with the debugger enabled. And just start clicking through and seeing what lines of code get executed to do various things in the system, which is invaluable. Like you cannot, you cannot gain that understanding in hardly any other way. In hardly any other more efficient way than just using the debugger and seeing what lines of code get run. Other than just running the code. I think searching in the project can also be a really helpful way to find relevant code. There's a number of tools that I mean when I say searching the project is going to vary from company to company significantly. Some common tools that people use for this kind of stuff are JIRA, Asana, Pivotal Tracker, GitHub, and get lab both have issues features. There's a lot of other tools that people use to track all this stuff. And it would be foolish not to at least try to find information in these things. Generally. A lot of times we are standing on the shoulders of giants in this field and we are able to leverage some sort of amount of past work in our current work. And what's useful to be able to do is understand what has already been done before you're in the situation so that way you know what you can reuse. And this will help you again get further along than you would be able to if you were just starting from scratch. I cannot tell you how often I have heard Other engineers and developers say things like, oh, the only reason I was able to get this done so fast is because I saw someone else in the company had done a project that was similar to this. And I was able to copy the config that they had for their spark job. And that enabled me to get my spark job running a lot more easily and not have to run into all these weird errors and things like that. So searching the project is super helpful. And one way that that looks, is to just literally open up JIRA, I went to the spark JIRA, because it's a, an open source project. And it's all available. So I'm allowed to show you it. But if you just search for like, Hey, I'm working on a ticket in spark about date times, like, let me just search date time in here and see what comes up. And then you can sort of change the way that things are ordered, understand what's happening in any given ticket and sort of give a glance over some of these things. And understand what work has come before you can be super helpful. also helpful to find out, sometimes you'll get a ticket, and it will like be something somebody has already done. And you can sometimes find that by doing this. This is what the GitHub issues UI looks like. This can be really helpful for you understanding what problems people are having, what what people are working in sort of what areas and large companies that can be particularly valuable to know, sometimes, you don't even know who to ask about something. And if you can figure out who has been working in this area, they can often sort of help you along and guide you along the path. And the other broad category of tools that I would describe when we're talking about finding relevant code, our code search tools. So I use one called the silver searcher maybe every day, it's super useful. there's a there's a another tool that's very similar to it called rip graph that I haven't used, but I've heard is really good. And just to give you an idea of what that looks like, if I'm working on a ticket that's about phonemes, for instance, I might go into my project and just type ag phoneme in the, in the console, and see everywhere in the code that the word phoneme is used. And now I might know Oh, hey, look, we have something in source slash app.pi. And I can go in here and understand this is where we're talking about phonemes. And it's a really good way to just figure out like, Where is this thing even being used or talked about in the code base, and can help you find help you find those relevant services, modules, classes, etc.

27:19
So these are really great tools. If you're working on code that you have checked out locally, because they work on you know, the code that's running, you know, the code that's present on your hard drive. But they're not as great when you have a large number of projects that, you know, might be spread throughout the company, you don't necessarily have all of it checked out onto your, onto your laptop. And you might need to know, hey, like I'm going to modify this thing in our service. But I want to make sure it doesn't break everybody else's. This is where these, these other tools can come in handy. Things like open grok, or source graph are really helpful, because they can let you search the entire code base of a certain organization, namely your organization. GitHub and get lab both also have a feature where you can search code within your specific organization. You can also search issues within your specific organization, that kind of thing. This is what open grok looks like I found this screenshot on the internet. One way you can see this happening here is that this person has searched for util with anything before and anything after it as a definition. And so this will look specifically for places where a class that has utility and the name of it is defined. And so you know, this person can now say, hey, file utility finds a file util. form, utility finds forum, util, etc. This is a really useful way to understand what code is relevant again. When we talk about understanding code, that's sort of the next step in the process, I would highly recommend using an ID. Again, you can totally get away with just using, you know, a text editor, and grep. Even if you want to use sort of superpower yourself with those search tools. That's also really helpful because you can do things like, hey, suddenly, I'm using, you know, this, this code is calling file util dot something, what does that meant to do? And you can search, you know, pull up silver searcher and search for utils, dot whatever, and find it. And that's really helpful. I think that's awesome. No no qualms with that. But in my experience, using an ID has been super helpful. I'm a big fan of JetBrains. I'm completely unaffiliated with them. But for instance, one thing that they have that's really nice is the ability to Command click on things. This is a little recording I took just so I could show you what that looks like. I sort of hear when you Command click on something. And let me scroll back here just a second. So when you command if I'm reading this get puns method, for instance. And I'm wondering, hey, this word, the phonemes function, what does that do? I can hold down command on my keyboard, which I think you know, if you have a Windows computer, you'd hold down Ctrl and click on it and it will take you right to the definition Have that file or sorry, have that function, which is very helpful. And then once you get to that function, you can actually do the same thing again, hold on command, click on the name, and it will show you usages of that. And so you can see, okay, we're defining this is used here, but it's also used in this other place. This is beyond useful. If you're using some sort of common methods. If you're finding trying to find out like, how do people instantiate this object, you can just go to the object and see where people are using it, which is going to be also you, you could read the documentation, or you could read the code and the code is gonna be a lot faster. As you're doing this reminder, create chunks, as we're going through this, we need to make sure that we are able to see the forest for the trees, we do not need to care about individual trees, we do not need to care about individual lines of code, what we need to be doing is understanding word to phonemes well enough that I can in my head, say, Okay, this method takes a word and turns it into a list of phone needs. And that's enough for me to then use that function effectively. If you're trying to keep around all of the lines of code that you've read in your head, you are not going to have any success, it's gonna be very difficult to do this. As you're doing this, one way to help guide you're thinking and sort of offload some of the work is to take notes, draw little diagrams. Again, this can be as simple as just getting a piece of paper out and drawing little little stuff with a pen, that's completely legit. You don't have to, you know, break out lucid chart and start doing UML diagrams or anything like that, you can just use a pencil and paper, it's completely fine. And I am consistently shocked at the amount of productive work that you can do when you get enough people around a whiteboard in a shared space. And that is one of the great tragedies of the pandemic is that the lack of ability to talk through problems on a whiteboard, by words can be super helpful for this kind of thing as well. I personally really enjoy using a digital note taking app for this, even if I've drawn a little diagram, sometimes I'll take a picture of it and save it into into my Notes app. So that way I can have it later if I'm working on the same thing. Another thing would be really useful as you're doing this, just in that same Notes app, just write out what you're what you're working on. This is a pro tip for those people who feel like they get interrupted by meetings a lot. If you are,

32:20
if you are like getting interrupted, having those notes is really helpful. Because you can go back and look and say, This is what I was working on at that time. Finally, again, make sure you're asking for help if you run if you run into trouble. And obviously try it for yourself first spend 15 minutes trying to work through whatever problem you have. But if you don't know what it is after that you should ask because you're you're wasting your time at that point. And a tool it's really helpful for this is something called get blame. This is a screenshot of a git blame for a particular file of a particular project that I have used and liked. And what you can see here is on over on the right, we have a like, you know, source code, basically. And then on the left side, we have the information that has like who's been working on that particular line, which is helpful, because you can then know, hey, if I'm wonder about get var type, I need to go talk to romaine x, I don't know how to pronounce that name, sorry, romaine x, if you're in the audience, let me know. And finally, when you're working with libraries, these are very common, obviously. And they often have documentation, that's pretty good. And I definitely recommend reading those if they're good, but sometimes they're not. And you have to use Stack Overflow. And that's completely legitimate. The, one of the more recent things that I have discovered is super helpful is being able to use GitHub search effectively. Because so many people use GitHub, and they allow you to search through all public code, you can learn quite a bit about how to use libraries just through that. So what that could look like, if, for instance, if I'm trying to set up a BigQuery job, I'm going to be using this query job configuration object. And let's say I'm reading through the starter page on like the GCP docs. And it gives me a good starting point, I you know, understand where I'm going with this. But I need to know, are there certain, you know, parameters that are commonly used on a career job configuration, that kind of thing is really helpful to know sometimes, and GitHub searches amazing for this, a huge fan of it, I highly recommend it, just go to GitHub, type in the thing and see what comes up really useful. Two pro tips on this, the first one, change the sort order. This is completely anecdotal. But I have found that sometimes adjusting to do most recently indexed is really, for some reason, getting the better results. I don't know why completely anecdotal. And then also choosing what language is is reliable, because you'll find language specific examples that help you sort of get a better hit rate in terms of what you're actually looking for. So this is that same search, but if I'm looking for recently indexed Java files that are using query job configuration, I can search this and see okay, these people have some sort A free tier billing service that uses query job configuration. And now what I can do is click specifically on this line that says 250. And see, here's the, you know, query job configuration builder that they're setting up, and I could see what options they're using. And that gives me a better understanding of how to set up that builder for myself, which is beyond useful. That was a lot. I'm gonna take a deep breath, I'm gonna drink some water, because I want to Everyone calm yourself, and then we'll we'll talk a little bit more.

35:41
Okay, there's a few takeaways. If you take away anything from this, here's the short and sweet. Make sure that when you're on a new team, you are working on software, make sure you are focusing on delivering valuable software I'm reading without a goal is not going to help you understand software any faster. And the best way to learn is by doing a really helpful thing to do is to focus early on providing value to your team, that could be creating an updated architecture diagram. And that could be just, you know, getting something working soon. That's that's a really nice thing to be able to do for your team. This is nice, not only because like that's what your job is, and that's what you're paid to do. But also, you're able to build goodwill, when you are showing your team, hey, I want to be a valuable member of this team, I want to support everybody. And like I'm here for you kind of thing. And people will be much more likely to reciprocate that feeling. If you are leading on that, that's a good way to do it. Finally, make sure you are decreasing cognitive load as much as you possibly can. There's a lot of really good tools to do this anything from, you know, paper and pencil all the way through to some some really, you know, somewhat high tech, searching software can be really helpful. Because this job is hard, no matter what you say, no matter what people say it's a hard job. And it's a lot of thinking. And so decreasing your cognitive load is a huge win. In that case. Thank you so much for talking. I really appreciate it. These slides are available at ISC GD slash SD f 20. That's SDF for Southern dev fest. My name is Samuel Taylor, I've loved getting to chat with you. Feel free to talk to me on Twitter, if you have questions. I, I hope this is relevant broadly. But if you have any specific questions about like data stuff, that's that's what I do. I'm a machine learning engineer. And so if you want to talk about machine learning, or AI or whatever, like, hit me up on Twitter, send me an email, I'd love to hear from you. One requests, if I can make any requests from you, this is the first time I've ever given this talk. And I want to make sure that it's as helpful as it possibly can be. And if you'd be willing to send me an email that just says one thing you liked, and one thing you didn't like about this talk, I would be immensely grateful for that. Thank you.

38:11
Okay, thank you for the amazing talk. loved all of the information. We're gonna take a couple of minutes to answer some questions, if that's okay with you.

38:20
I'd love that. Yeah, that sounds great. Sorry. And I'm gonna be looking over here because this is my like my cameras over here. But my notes over here, so I'll be able to see the questions over here.

38:28
All good. All good. So the first question we have comes from Vanessa fountain, how do you approach a situation where you bring up a new way to do something and the other team members do not want to move forward with a newer solution, due to not having familiarity with new tech?

38:42
Oh, Vanessa, if only I knew the answer to this question.

38:45
Um,

38:46
so there are a few things that I would recommend here. I think it's a, it's it's a really difficult spot that you're putting when you're in this situation. Because some, I think there is a lot of validity to using sort of stable, well established technology and not sort of chasing after the shiny new thing. But I also think there's there can be a lot of value in using some new tool that enables you to do something that you didn't realize you could one thing that I have seen people do in this kind of a situation, I try to build some sort of small prototype that demonstrates why this new technology might be valuable. And then that gives you a more concrete thing to talk about. Sometimes these conversations are way too abstract, and people struggle with that, or at least I struggle with it. And when we have a concrete example of like, okay, here's what, you know, this new framework allows us to do, we can see Oh, that's actually super nice. Or we can say Actually, this isn't as big of a witness. I thought it was gonna be

39:47
that'll work or let me scroll up and down here on the comments section to make sure that we don't have anything else. I don't believe we do. We do have plenty of positive feedback though that was great Samuel photoreceptor says that Samuel Taylor first time didn't seem like it great. Alex, door lag Great job, Samuel. So Samuel, I want to thank you so much for your time here to get this information how to us all of these tips have been amazing. Would you do me a favor and posts that very last up curious. Let me post that up there. I'll throw this up there.

Univariate k-Nearest Comparison (Trustworthy Models)

Samuel Taylor — Thu, 01 Oct 2020 11:25:55 GMT

We often fly blind in the world of machine learning. Our model outputs an estimate for revenue from clicking on a certain ad, or the amount of time until a new edition of a book comes out, or how long it will take to drive a certain route. Typically these estimates are in the form of a single number with implicitly high confidence. Trusting these models can be foolish! Perhaps our models are high-tech con men -- Frank Abagnale reborn in the form of a deep neural net.

If we want to call ourselves "Data Scientists", perhaps it is time to behave like scientists do in other fields.

I didn't study a natural science (like astronomy or biology), but I did take a few physics classes with labs. One such lab required us to find the gravitational constant by dropping a metal ball from a variety of heights. During the experiment we were careful to record uncertainties in our measurements, and we propagated uncertainty through to our final estimate of the gravitational constant. If I had turned in a lab report claiming g = 9.12 m/s^2 (without any uncertainty estimate), I would have lost points.

Good scientific measurements come with uncertainty. A ruler or measuring tape is only so precise. When it comes to machine learning, though, this focus on uncertainty disappears.

For example, see this common formulation of the learning problem from the book Learning from Data [0] (no shade -- I love this book):

There is a target to be learned. It is unknown to us. We have examples generated by the target. The learning algorithm uses these examples to look for a hypothesis that approximates the target.

This unknown target function they call f: X -> Y (where X is the input/feature space and Y is the output/target space). The hypothesis approximating the target they denote as g: X -> Y (and, if our learning algorithm is successful, we can say g ≈ f).

In the case of regression, the output of our learning algorithm is a function which produces a continuous-valued output. But this output is a point estimate. It has no sense of uncertainty [1]!

Others have of course noted the importance of uncertainty estimation before me. One such person is José Hernández-Orallo (a professor at Polytechnic University of Valencia), whose paper Probabilistic reframing for cost-sensitive regression I found while researching class imbalance. I would be misrepresenting his work to claim this paper is solely about uncertainty/reliability estimation, but he describes some neat ideas worth exploring.

Rather than finding a function g: X -> Y approximating the true underlying function f, we could instead seek to find a probability density function h(y | x). In other words, a function to which we can still pass some features (x) but one describing a distribution instead of a point estimate. Because we are estimating a probability density function conditioned on the input features, this idea is called conditional density estimation.

To hear Hernández-Orallo tell it, many methods for conditional density estimation are suboptimal. The mean of the distributions they output is typically worse than a point estimate would have been. They are often slow. And in many cases, the distributions don't end up being multi-modal anyway. Thus, the paper asserts we can get by with a method to provide a normal (Gaussian) density function for most cases.

Normal distributions are parametrized by a mean and a standard deviation. Taking a point estimate (from any regression model) as the mean, we still need to determine the standard deviation. The paper describes a few different approaches for doing this (and is worth reading if you have the time). For the sake of this post, I'll focus on a technique called "univariate k-nearest comparison". A simple Python implementation follows:

def univariate_knearest_comparison(model, dataset, test_point, k=3):
    all_preds = model.predict(dataset)
    Q = zip(all_preds, dataset["Weight"].values)
    prediction = model.predict(test_point)
    neighbors = sorted(
        ((y_pred, y_true) for y_pred, y_true in Q),
        key=lambda t: np.abs(t[0] - prediction),
    )[:k]
    variance_estimate = (1 / k) * sum(
        (prediction - y_true) ** 2 for _, y_true in neighbors
    )
    return prediction[0], variance_estimate[0]

Hernández-Orallo describes this procedure as looking "for the closest estimations in the training set to the estimation for example x", then comparing "their true values with the estimation for x".

This technique is cool because it can be applied to any regression model. By using the training set (or a validation set) in this clever way, we can enrich any model with the ability to estimate uncertainty (thus gaining the second aspect of what we've been calling trustworthy models).

If you've not read previous posts in this series, we've been working with a dataset about fish. From their dimensions, we're trying to predict their weight. It's totally a toy/unrealistic problem, but it's pedagogically useful.

We'll start by training a linear model.

ct = ColumnTransformer([
    ('scale', StandardScaler(), ['Length1', 'Length2', 'Length3', 'Height', 'Width']),
    ('ohe', OneHotEncoder(), ['Species']),
])

pipe = make_pipeline(ct, lm.LinearRegression())
pipe.fit(fish, fish['Weight'])

Then, we'll run the univariate k-nearest comparison function from above.

new_fish = pd.DataFrame(
    [
        {
            "Species": "Bream",
            "Weight": -1,
            "Length1": 31.3,
            "Length2": 34,
            "Length3": 39.5,
            "Height": 15.1285,
            "Width": 5.5695,
        }
    ]
)

pred, var = univariate_knearest_comparison(pipe, fish, new_fish)
# (646.1153309725989, 4896.726887004621)

With our prediction and variance estimate in hand, we can draw a normal distribution.

import scipy.stats as st

def plot_normal(pred, var, color='coral', do_lims=True):
    width = np.sqrt(var) * 3.8
    x_min, x_max = pred - width, pred + width

    x = np.linspace(x_min, x_max, 100)

    y = st.norm.pdf(x, pred, np.sqrt(var))

    plt.plot(x, y, color=color)
    if do_lims:
        plt.xlim(x_min, x_max)

        ylo, yhi = plt.ylim()
        plt.ylim(0, yhi)

        plt.title('Conditional density of fish weight given features')

plot_normal(pred, var)

Here's a little graph with conditional density estimates for several different fish on it.

This is a strong step in the direction of being more scientific in our modeling efforts. We've examined a few methods for uncertainty estimation in this series, and we'll evaluate the quality of these techniques at a later date.

If you found this interesting, consider following me on Twitter.

Footnotes

Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). _Learning from data: A short course_. United States: AMLBook.com.
I recognize that I'm equivocating on the word "uncertainty" to some extent. Still, I think this is a useful idea even if only as an analogy.

new codebase, who dis? (How to Join a Team and Learn a Codebase)

Samuel Taylor — Sun, 06 Sep 2020 21:06:42 GMT

I have switched teams more often than I have had to implement an AVL tree, and you can guess which one of those two was taught in school. I wish someone had taught me how to join a new team! While learning a new codebase can be daunting, I've found a few things that work for me.

You should do at least three things when joining a new team. The order of these three can be whatever you like, but all three should be done as soon as reasonably possible.

First, you’ll likely set up the development environment. As you do this, pay attention to just what it is that you're setting up. For instance, if you need to get Redis running locally, then that's a good hint that there's some caching happening somewhere. Noting the order in which you run internal projects helps you understand dependencies. If the feature store needs to be running before you bring up the model serving service, that's a hint that the model serving service may depend upon the feature store. Such dependencies start to hint at the overall architecture.

Take notes on the exact commands you’re running and packages you’re installing. You’re bound to run into something that’s changed since the setup docs were written, and being able to correct them is a quick win you can provide to the new team. Plus, it's good to know exactly how you ruined your system installation of Python.

Ideally, the code you're working on should have some sort of automated test suite in place. A good way to start experimenting with and understanding the code is to get that test suite successfully running, then make changes to the codebase completely at random and see what breaks.

The second thing you should do is get some overview of the architecture. Some teams will have a document describing this, and if that document is an accurate depiction of reality then you should certainly work to understand it. In any case, asking a more senior person on the team to give you an overview is a good idea. They should know how up-to-date that document is (if it does exist) and also be able to describe and/or draw the architecture for you. Here are some sample questions you can consider asking:

What repositories (or portions of repositories) do we own and/or work on most frequently? What do each of them do?
Where does our code run? (e.g. EC2 Instances, Google's Kubernetes Engine, on prem)
What does the deployment pipeline look like? How does a feature get from my laptop to live in production?
Do we have certain services, packages, classes, or files that are a real headache? Areas that are particularly unreliable or error-prone?
Are there external API's, vendors, or products that we use or rely on? (e.g. SendGrid, DataDog, MySQL)

Similar to environment setup, an easy thing you can do to help the team is to document that architectural overview (or update the existing document). Write down what you learned, take a picture of the diagrams that were drawn, and post that information somewhere visible to the team. Be sure to put an "as of" date on your changes. Even stable projects exhibit some change over a long enough time period, so this date will help future readers know if they can trust this document.

A third thing you should do when starting on a new team is start understanding the business. If you're new to the company, figure out its mission, product offering(s), and goal(s). Then work to understand how your team fits into those things. Some sample questions include:

How can our team make an impact on the company's goals?
If our code were to break horrifically, who would get angry? How fast would that happen?
What other teams do we have the most interaction with? What services/codebases do they own? Do we share parts of our codebase with other teams?

Without an understanding of the team's place in the company, you're doomed. You won't have sufficient context to execute your work well.

Mindset

I strongly believe that learning a new codebase happens best through implementing real features (even if they are small to start with). The whole point of being on this team as an individual contributor is to build stuff, and there is no better way to learn how to do something than by spending quality time doing that exact thing. As you build skill and understanding, you can work on larger and larger projects over time.

Implementing something will require you to read the code. But "read" may be a misleading word here, because reading code is dramatically different from reading a novel. Code is typically organized with more related code being closer together (in the same directory, package, class, or file). Can you imagine a novel written in this way? If Tolkien had placed all scenes of two characters fighting each other in adjacent pages, while all scenes with magic in them occurred in a separate book? How absurd!

Though learning to code taught me the basics of reading code, nobody ever taught me how to read a large codebase. To do so, we must adopt a certain mindset. Balance understanding each intricate detail against making impact quickly. Quick impact helps establish your reputation on the team and gets you to that accurate/intricate understanding faster than trying to read everything up front.

The rule of thumb I use is to understand something just enough to express what it does without necessarily knowing exactly how it does that. This process is called "chunking," and it relies on the fact that once you have a basic understanding of a unit of code, "you don't need to remember all the little underlying details" (Oakley). If you're worried about not understanding everything in minute detail, don't be afraid to take a note to come back and understand that chunk more fully.

This understanding will grow recursively: first, you understand what the various services do. Then, you identify the particular service you need to modify and start to understand the various modules within that service. In the modules you modify, you'll start to understand the classes contained. The base case of this recursive process is the individual line.

Keep in mind that different teams may implement the same concept or pattern in different ways. Understanding why your current team chose the way they did is another way new teammates can help the team. It's totally possible that your new team hasn't heard of the cool way to implement singletons that you like. It's equally possible that your way is worse in some way you didn't know. Either way, someone gets to learn something!

The last mindset recommendation I'll give before we dive into the process is to try to understand the code both in terms of code paths and data flows. Think about which objects know what information and how that information flows between parts of the system.

Process

I recommend this process for working in any codebase:

Locate the portion of code most relevant to the immediate task at hand.
Understand that code enough to form a hypothesis about the change you need to make.
Make that change and test your hypothesis. Sometimes the best way will be to click around in the UI or run a particular script. Sometimes the easiest path is to write a test that describes the behavior you're after.
If your hypothesis was incorrect, return to step 2. Understand why that change didn't do what you thought it would, and develop a new hypothesis.
Once you have working code, improve its quality. Write a test (or a few) that document the changes in behavior you made. Refactor your code for clarity and style.

This scientific approach guides us gradually toward correct, high quality code without having to understand each and every bit of code around our change.

Tools

While you could certainly get by with just a text editor and some patience, a wide variety of tools exist that help us read code more effectively throughout the process identified above.

Identifying relevant code

While step one gets easier over time as we build familiarity with some portion of code, we often begin step one completely lost. A few approaches are helpful here: running the code, project search, and code search.

Running the code helps you understand it. Before you start changing things, understand what already exists. This could mean reproducing a bug locally, finding the place in the UI where the new feature will go, or any number of other things. When you do, stepping through the execution in a debugger will give you a strong start on understanding what is going on.

By "project search," I mean searching artifacts created as part of the software development lifecycle. Particularly useful are issue trackers like JIRA/Asana/Pivotal Tracker, pull requests and issues in tools like GitHub and GitLab, and the git history itself. Because few tasks are truly novel, we can often gain understanding by looking for similar past work. Try several different keywords. Sometimes you'll find a pull request that implements something very similar to what you want to do, and you can use that as a guide. Trying to divine something from scratch, while sometimes necessary, requires significantly more effort than adapting from an example.

Code search is just what it sounds like. For code that you have checked out locally, I highly recommend using a tool specifically built for recursive search like ack, Silver Searcher (ag), or ripgrep. But you won't always have every bit of code at the company checked out locally, and sometimes it's useful to be able to search exhaustively. For this use case, tools like OpenGrok or Sourcegraph are super helpful. GitHub and GitLab also offer ways to search all code within a specific organization.

No matter which tool you're using, try several keywords you think might be relevant. Consider changing case sensitivity. You may have better results filtering down to specific file types.

Understanding code

Using these various search tools, we arrive at a set of relevant locations. Thus, we arrive into step two of our process: understanding the code just well enough to form a hypothesis about the necessary change. The search tools we've already discussed are helpful to this end (if you come across usage of an unfamiliar class, search for it and read what you find).

One other tool that is incredibly useful is a good IDE. I like JetBrains' products (I have no affiliation with them), though I'm sure similar functionality exists in competing products. JetBrains IDE's can help you navigate code much more efficiently by linking you straight through to the definition of a function or class. By default on Macs, hold down Cmd and hover over the function or class name, then click. Being able to immediately jump to the definition is a complete game changer.

Another super-useful JetBrains keyboard shortcut is (by default) tapping shift twice. This brings up a search bar that can find just about anything (classes, functions, file names).

As you read code, always try to decrease your cognitive load. Remember to create "chunks", mental boxes inside of which you don't need to remember all the details. Consider taking notes, writing down file names and line numbers, drawing little diagrams. Reading and writing code is the most cognitively demanding part of the job, so take any chance you can get to make it easier for yourself.

You may get stuck or lost during this process. It is OK to ask for help. Use git blame to see who has been working on some bit of code you find confusing, and ask them about it. You can also use git blame to find relevant pull requests or JIRA tickets that might help you gain context.

Working with libraries

Sometimes as part of step three, we will need to work with an external library. In an ideal world, all libraries have excellent documentation that helps you understand the key abstractions and be productive quickly. Alas, we do not live in an ideal world! Many projects do have good documentation. But others may be more easily learned through the broader community. Consider searching the web with a tool like DuckDuckGo or Google. See if there are examples on StackOverflow.

A recent lightbulb moment for me was realizing that GitHub allows users to search all public code. Consequently, we can find realistic examples of people using libraries and API's that we care about. Try searching for the particular method name you're trying to use. Or search for the name of the package, then search within individual repositories that come up. Consider filtering to just the language you care about.

Anecdotally I have found that sorting GitHub search by "recently indexed" gives me more diverse, more helpful results than the default search (which largely gives me the same copy-pasted examples over and over again). If you're unhappy with your results, do try different sort orders.

Parting words

Not only do we learn faster when we orient that learning around real tickets, but we simultaneously make an impact and start building reputation on the team. By taking advantage of prior work (and using good tools to find that work) we can accelerate our learning and our impact. Know that while joining a new team is non-trivial, it doesn't have to be hard! Use the scientific method. Follow these practices. Take a look at these tools. You'll gain confidence in your abilities and make a good first impression while you're at it.

If you found this interesting, consider following me on Twitter. Thanks to my friend Benjamin Cody for providing feedback on this post.

Update 2021-01-16: @Coding_Career on Twitter made an awesome "cheat sheet" from this post available here.

Citations:

Oakley, Barbara A. A Mind for Numbers: How to Excel at Math and Science (Even If You Flunked Algebra). Jeremy P. Tarcher/Penguin, 2014.

Lightweight testing for maintainable data science

Samuel Taylor — Thu, 20 Aug 2020 00:45:51 GMT

When I began working in analytics, one of the most miserable types of tasks I ended up doing was re-running an old Jupyter notebook. Often it failed part way through with some inscrutable error. Figuring out what was going on was challenging; how am I supposed to remember this particular notebook from five months ago? What's more, the underlying data sometimes stops getting updated, or a column name changes, or the date format in a particular field switches. You may have had similarly frustrating experiences. The good news is that simple techniques from the field of software engineering can dramatically improve this experience.

As you may have guessed from the title of this article, I'm a big fan of testing. It's easier than you realize, and it'll save you a ton of headaches. For our purposes today, let's consider a machine learning project that consists of three phases: first, exploratory data analysis and prototyping. Second, model training. And third, running in production. All three of these phases can benefit from testing.

One: EDA and prototyping

When exploring the data, we learn a significant amount of information. Here are some examples of questions we might answer:

How many columns are there in this dataset?
What are the names of each column?
What data type does each column contain?
For string columns, how many unique values exist?
For numeric columns, what range does the data fall into?

Too often, we keep the answers to these questions in our head alone. This fact is part of what makes it difficult to go back to an old notebook; these answers have fallen out of our short- and long-term memory by the time we return to the notebook. Fortunately for us, computers have excellent memories! We could, of course, write down each of the answers to these questions directly in our Jupyter notebook, which will help us when we return to it. Still better, though, is expressing the answers to these questions as executable code -- as tests.

When doing initial analysis, I find it cumbersome to even think about running a testing framework inside my notebook. Fortunately, we can get by without one: Python includes the assert keyword, which will do just fine. For example, we might encode the knowledge that our DataFrame should have 8 columns thusly:

assert df.shape[1] == 8

This is an improvement over a comment or markdown cell that simply states "DataFrame should have 8 columns" because the computer will actually check this for us each time the notebook is run. And if that condition is not met, we will see an error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-13-ed79b70114d8> in <module>
----> 1 assert df.shape[1] == 8

AssertionError:

In this case, this may be an acceptable error. We can read the condition that was asserted and back into the conclusion that our DataFrame should have eight columns. But if we're feeling quite charitable toward our future self, we can add a message:

assert df.shape[1] == 8, "Expected 8 columns"

which, assuming the condition is not true, will result in this error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-14-18deb3201a98> in <module>
----> 1 assert df.shape[1] == 8, "Expected 8 columns"

AssertionError: Expected 8 columns

Writing an assert statement is a cheap insurance policy against unexpected changes. I highly recommend making assertions about the shape of your dataset, the sparsity of certain columns (assert df['a'].notnull().mean() > 0.9), the existence of particularly important columns (assert 'age' in df), and the range of numeric columns (assert (df['age'] < 0).sum() == 0). As a general rule, if you're making an assumption in your code, consider whether you can express that assumption as an assert statement.

Two: training script

A common pattern I've seen in machine learning work is to take a Jupyter notebook that contains code to train a model and turn it into a Python script (which is more easily run/monitored in certain environments). To do this, I recommend taking chunks of the notebook which do a discrete unit of work and turning them into standalone functions that the notebook then uses. Specifically, create a .py script in the same directory as the notebook (say, helpers.py), define a new function, and copy the code from the notebook into that function. Then, import the function (for example, from helpers import age_range_to_midpoint), delete the code you pasted into the script, and use the function instead.

As an example, suppose our data encodes age as a range ("0-25", "25-40", "40-100"), and we have decided that we want to represent this to our model with the midpoint of the range. Our helpers.py script might contain the following:

def age_range_to_midpoint(age_range):
    endpoints = [e.strip() for e in age_range.split('-')]
    return sum(int(e) for e in endpoints) / len(endpoints)

At this point, I believe it's worth it to use a testing framework. Python has one built in, but I love using pytest. As we create functions, we can add tests by defining a function (or functions) whose name(s) begin with "test_":

def test_age_range():
    assert age_range_to_midpoint('20-30') == 25
    assert age_range_to_midpoint('0-31') == 15.5

Just like the asserts we created during EDA encode information about our data, these tests encode information about how our code works. By the end of this process, we have a nice file of functions and a notebook which largely runs those functions in a certain order. Turning this notebook into a Python script is now simple, as the complex logic is already present in our helper file.

We can run our tests with a simple command:

% pytest helpers.py
================================== test session starts ===================================
platform darwin -- Python 3.8.5, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /my/cool/project
collected 1 item

helpers.py .                                                                       [100%]

=================================== 1 passed in 0.00s ====================================

These tests enable us to make changes to our code more confidently. We can run these tests ourselves after changes we've made to make sure we haven't broken anything. Ideally, we can set up some sort of automated process that runs these tests as commits are made (both GitLab and GitHub offer tools that do this).

Further, these tests serve as executable documentation. While it is easy for comments to go stale, tests remain an accurate description of what a function does. If I introduce a change to the way a function works, I must also edit the tests (or else they will fail, and I will be sad). In this way, tests are a far more reliable and accurate kind of documentation than comments.

Three: production

While a thorough treatment of putting a model in production is outside the scope of this article, testing is certainly a part of it. In his book Building Machine Learning Powered Applications, Emmanuel Ameisen coins the term "check" to describe a test that runs in the production prediction pipeline (rather than in a CI/CD pipeline). The same kinds of common sense assert statements you wrote in your Jupyter notebook are also helpful sanity checks in a prediction pipeline.

You should write checks for both inputs and outputs of your model. Is someone passing in a negative value for the age of a human being? Is our model predicting that a car will have a fuel efficiency of over 9,000 miles per gallon? Both of these cases seem unexpected! Depending on the business requirements, we may take a variety of actions. For instance, if our model is predicting a huge value for miles per gallon, we might refuse to make a prediction:

y_pred = model.predict(X)
if y_pred < 0 or y_pred > 100:
    raise PredictionError('Problem predicting mpg for this car')

In other cases, we may be able to use a heuristic:

if y_pred < 0 or y_pred > 100:
    return 30

Sometimes, we may be able to swap in a simpler model if it's available and more robust. Or we can replace nonsensical feature values for nulls, or impute a value. There's a lot of options here, and you should be careful about choosing the right one for your use case. A well-written check prevents a certain class of bug from becoming an issue, thereby improving the robustness of the system overall.

Go forth and test

Keep in mind how you can introduce testing throughout your process. Whether it's a quick assert statement in a Jupyter notebook, a unit test in a Python script, or a check that runs in production, well-written tests are a gift to your future self and your team. Tests make code less error prone, easier to debug, and less vulnerable to decay.

APChemSolutions review - do not buy

Samuel Taylor — Thu, 13 Aug 2020 02:15:31 GMT

Author's note: this is not what I usually write about. If you're an educator considering purchasing a product from apchemsolutions.com, please read on! Otherwise, don't worry about it.

Executive summary

An educator I know recently contacted me for some IT help with a product they had purchased from apchemsolutions.com (AKA AP Chem Solutions). In helping this person, I was appalled by the level of contempt the creator(s) of this product have for their users. I strongly recommend against purchasing anything from AP Chem Solutions (which uses the domain apchemsolutions.com).

More detail

I completely understand that companies have a right to protect their intellectual property. However, AP Chem Solutions chooses to use DRM (digital rights management) software which harms its customers. Here are a few reasons I think their product is bad:

1. Jumping through hoops

The educator who reached out to me receives an email with instructions on how to open the PDF files in the provided ZIP file. First, the user must have Adobe Reader installed. Then, there is a set of six to seven items which describe how to install a certificate on your computer. Then, there are a further 5 steps to get Adobe Reader set up to actually read the PDF's.

I have a degree in computer science, and I found these steps to be frustrating and annoying. It is no wonder that this person needed help! While I have immense respect for teachers, the amount of hoops this company expects them to jump through is beyond anything I would expect a teacher at any level to accomplish on their own.

2. Administrative access

Which brings us to the next point: admin access. To import this certificate in the first place, the customer is expected to enter the admin password for the computer. This might be all fine and dandy, except that it is rare for teachers to have admin access to the computers provided them by the school.

You know how much of a pain it is to interact with your IT department? Well, if you want to use this product, you're going to have to bug IT. And they might not want to help you with this. It's a very real possibility that your request will get denied and you will be completely unable to use the product you purchased from AP Chem Solutions.

3. Small annoyances

All of my other concerns are more important than any of the minor things I'm listing here, but I wanted to list them:

Instructions are at times unclear or poorly written. I can easily see someone getting confused trying to follow them.
Instructions are at times factually inaccurate (e.g. you don't need to set your default PDF reader to Adobe Reader)
You cannot use these materials on operating systems other than Windows or macOS (because Adobe Reader isn't available on other platforms)
They require you to disable some security features of Adobe Reader. I am not a cybersecurity buff, but I imagine this exposes your computer to additional risk of viruses and/or malware.

4. Printing restrictions

Finally, the reason this company has had you jump through all these hoops in the first place: to keep you from printing or redistributing certain portions of the materials they have provided you. Would you like to print a copy of the slides for your students or a sub? You can't!

And forget about sharing the PDF's digitally; anyone you send them to will have to go through the same setup process you did. Good luck explaining that to them.

An alternative

Rather than use restrictive, draconian DRM software, AP Chem Solutions should put its customers first. The company should provide its customers what they believe they have purchased: access to materials that will help them make students more successful free from restrictive DRM. Teachers, who are already overburdened, should not have to have a four year degree in computers just to use a product they have purchased.

To reiterate: please do not buy anything from AP Chem Solutions, which operates on the domain name apchemsolutions.com. It appears they care more about lining their pockets than they do about your ability to actually use the product you purchased.

Model-Agnostic Uncertainty Estimates through Bootstrapping

Samuel Taylor — Fri, 17 Jul 2020 00:49:51 GMT

A key element of a trustworthy model is that it can give an estimate of its confidence in a given prediction. We've already talked about one way to do this for linear models, and today we'll talk about a technique for getting uncertainty estimates for any model.

Let's continue using the fish dataset from last time:

import os
import pandas as pd

fish = pd.read_csv(os.path.expanduser("~/Downloads/Fish.csv"))

We build a ColumnTransformer for convenience:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = ColumnTransformer(
    [
        ("scale", StandardScaler(), ["Length1", "Length2", "Length3", "Height", "Width"]),
        ("ohe", OneHotEncoder(), ["Species"]),
    ]
)

Next we construct a pipeline which uses the ColumnTransformer from above as well as scikit-learn's implementation of bagging. Specifically, our BaggingRegressor will consist of 100 ElasticNet models, each one trained on a random 25% of the dataset (with replacement).

from sklearn.ensemble import BaggingRegressor
import sklearn.linear_model as lm

pipe = make_pipeline(
    ct, BaggingRegressor(lm.ElasticNetCV(), n_estimators=100, max_samples=0.25, random_state=42, n_jobs=-1,)
)

pipe.fit(fish, fish["Weight"])

Finally, we can snag those 100 models and make a prediction for a new fish:

from sklearn.ensemble import BaggingRegressor
import sklearn.linear_model as lm

pipe = make_pipeline(
    ct, BaggingRegressor(lm.ElasticNetCV(), n_estimators=100, max_samples=0.25, random_state=42, n_jobs=-1,)
)

pipe.fit(fish, fish["Weight"])

new_fish = pd.DataFrame(
    [
        {
            "Species": "Bream",
            "Weight": -1,
            "Length1": 31.3,
            "Length2": 34,
            "Length3": 39.5,
            "Height": 15.1285,
            "Width": 5.5695,
        }
    ]
)

predictions = [e.predict(new_fish)[0] for e in estimators]

plt.hist(predictions, bins=15)
plt.savefig("twm1_hist.png", bbox_inches="tight")

Which gives us a nifty histogram of expected weight:

The cool thing about this approach, though, is that we can swap in any model within the BaggingRegressor, and the rest of the code is unaffected. For instance, here's the distribution of predictions when using decision trees:

Interesting idea, right? There's still a few more approaches I want to highlight in coming posts, but after that I'll be comparing them all to see which uncertainty estimation technique is best.

Comments? Questions? Concerns? Please tweet me @SamuelDataT or email me. Thanks!

Trustworthy Models in Practice: a Simple Linear Approach

Samuel Taylor — Sun, 14 Jun 2020 21:07:07 GMT

Last time, we began to talk about how to build models worthy of our users' trust. As a refresher, we said that trustworthy models require at least three things:

Prediction -- An estimate for some unknown value

Confidence -- A description of how uncertain the model is about the prediction

Explanation -- The reasoning for which a model made its prediction

Today, we'll take a pass at actually implementing such a model.

Dataset

For pedagogical reasons, we're using a dataset on fish that were sold at a fish market. Here's a few rows from the dataset:

| Species | Weight | Length1 | Length2 | Length3 | Height  | Width  |
|---------|--------|---------|---------|---------|---------|--------|
| Perch   | 250.0  | 25.9    | 28.0    | 29.4    | 7.8204  | 4.2042 |
| Bream   | 714.0  | 32.7    | 36.0    | 41.5    | 16.517  | 5.8515 |
| Perch   | 145.0  | 22.0    | 24.0    | 25.5    | 6.375   | 3.825  |
| Perch   | 145.0  | 20.7    | 22.7    | 24.2    | 5.9532  | 3.63   |
| Bream   | 975.0  | 37.4    | 41.0    | 45.9    | 18.6354 | 6.7473 |

The first step, of course, is to load it up!

import os
import pandas as pd

fish = pd.read_csv(os.path.expanduser("~/Downloads/Fish.csv"))

Building a model

For our exercise today, let's see if we can predict Weight given the values of the other columns. We're going to use statsmodels to build a simple linear model.

import statsmodels.formula.api as smf

model = smf.ols(
    formula="Weight ~ C(Species) + Length2 + Length2 + Length3 + Height + Width",
    data=fish,
).fit()

If you've never used statsmodels before, think of this as fitting a linear model, with Species being one-hot encoded. statsmodels has a nice way of getting basic information about the model:

model.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                 Weight   R-squared:                       0.936
Model:                            OLS   Adj. R-squared:                  0.931
Method:                 Least Squares   F-statistic:                     195.7
Date:                Sun, 14 Jun 2020   Prob (F-statistic):           6.85e-82
Time:                        15:00:23   Log-Likelihood:                -941.46
No. Observations:                 159   AIC:                             1907.
Df Residuals:                     147   BIC:                             1944.
Df Model:                          11
Covariance Type:            nonrobust
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                -918.3321    127.083     -7.226      0.000   -1169.478    -667.186
C(Species)[T.Parkki]      164.7227     75.699      2.176      0.031      15.123     314.322
C(Species)[T.Perch]       137.9489    120.314      1.147      0.253     -99.819     375.717
C(Species)[T.Pike]       -208.4294    135.306     -1.540      0.126    -475.826      58.968
C(Species)[T.Roach]       103.0400     91.308      1.128      0.261     -77.407     283.487
C(Species)[T.Smelt]       446.0733    119.430      3.735      0.000     210.051     682.095
C(Species)[T.Whitefish]    93.8742     96.658      0.971      0.333     -97.145     284.893
Length1                   -80.3030     36.279     -2.214      0.028    -151.998      -8.608
Length2                    79.8886     45.718      1.747      0.083     -10.461     170.238
Length3                    32.5354     29.300      1.110      0.269     -25.369      90.439
Height                      5.2510     13.056      0.402      0.688     -20.551      31.053
Width                      -0.5154     23.913     -0.022      0.983     -47.773      46.742
==============================================================================
Omnibus:                       43.558   Durbin-Watson:                   0.973
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               97.422
Skew:                           1.184   Prob(JB):                     7.00e-22
Kurtosis:                       6.016   Cond. No.                     2.03e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.03e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

At this point, we can achieve our first objective: to provide a prediction!

new_fish = pd.DataFrame(
    [
        {
            "Species": "Bream",
            "Weight": -1,
            "Length1": 31.3,
            "Length2": 34,
            "Length3": 39.5,
            "Height": 15.1285,
            "Width": 5.5695,
        }
    ]
)
model.predict(new_fish)

This model predicts this fish weighs about 646 grams.

Providing uncertainty

The main reason I've chosen to use statsmodels (rather thank scikit-learn) is that it provides built-in support for prediction intervals. Take a look:

frame = model.get_prediction(new_fish).summary_frame(alpha=0.95)
frame.round(2)

| mean   | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper |
|--------|---------|---------------|---------------|--------------|--------------|
| 646.12 | 18.32   | 644.96        | 647.27        | 640.11       | 652.12      |

mean here is the prediction, and a 95% prediction interval is provided by obs_ci_lower and obs_ci_upper. In other words, our model thinks the weight of this fish is between 640 and 652 grams with 95% probability.

We're two thirds of the way there!

Providing an explanation

We can use the structure of the model to provide an explanation. The prediction is equal to:

  -918          (the intercept)
-   80.3 * 31.3 (Length1)
+   79.9 * 34   (Length2)
+   32.5 * 39.5 (Length3)
+    5.3 * 15.1 (Height)
-    0.5 *  5.6 (Width)
   ------------
   646.12

A way we might display how the various features contribute to the overall prediction is this:

def fish_to_feats(a_fish, model):
    feats = a_fish.copy()
    feats["Intercept"] = 1.0
    for species_feat in model.params.index:
        if not species_feat.startswith("C(Species)"):
            continue
        species = species_feat.split(".")[1].replace("]", "")  # This is ugly
        feats[species_feat] = (feats["Species"] == species).astype(int)

    del feats["Species"]
    return feats[model.params.index]


contributions = fish_to_feats(new_fish, model) * model.params
for name, amount in sorted(
    contributions.round(1).iteritems(), key=lambda t: -t[1].abs()[0]
):
    if -1e-3 < amount[0] < 1e3:
        continue
    print(f"{name}: {amount[0]}")

Which provides the following output:

Length2: 2716.2
Length1: -2513.5
Length3: 1285.1
Intercept: -918.3
Width: -2.9

This could certainly be made more user friendly, but it does give some kind of explanation for why the model believes this fish to weigh 646 grams.

Conclusion

We've built a model that can provide trustworthy predictions. For example:

My best guess at the weight of this Bream is 646g.
With 95% probability, the weight is between 640g and 652g.
The biggest contributors to this prediction are Length2 (pushes the prediction higher), Length1 (pushes it lower), and Length3 (pushes it higher).

I highly recommend attacking machine learning problems by starting with an incredibly simple model first. Implementing that end-to-end enables focus on the truly difficult parts of machine learning (i.e. not the ML bits). For some use cases, this post provides yet another reason to love linear models: they are trustworthy by default!

Comments? Questions? Concerns? Please tweet me @SamuelDataT or email me. Thanks!

Building Trustworthy Models

Samuel Taylor — Fri, 15 May 2020 22:26:30 GMT

Machine learning has a trust problem. Discussions about the role that algorithms play in our lives have become national (if not global), with some raising important and legitimate questions about the biases inherent in these algorithms. In this environment, we wonder: what would it take for a model to be worthy of our trust?

I recently read an illuminating piece by David Spiegelhalter called "Should We Trust Algorithms?". In it, he identifies the difference between trustworthy claims about a system and trustworthy claims made by a system. His article spends more time on the former than the latter, so I've written this article to elaborate on ways our models can make more trustworthy claims.

Motivation

Building trust with users is essential for a few reasons. First, we want our products to be used. If a user doesn't trust the predictions made by my model, she is less likely to follow its advice. Worse, without adequately communicating uncertainty, we may actively anger her. Suppose a model predicts this user's house will sell for $300,000, but it ends up selling for $290,000. It's difficult to fault her for being upset at this $10,000 difference.

By contrast, if we predicted a range of possible sale values, the user would have better expectations going in and a better experience with our product.

Ethics provide a second reason that building trust is paramount. It is unethical to present estimates without a sense for their uncertainty. A common machine learning approach is to build some classification or regression model for a problem. These models typically output a single predicted value: "this flower is a setosa", "this house is worth $300K", or "this image has an airplane in it". These statements imply a level of certainty that may be unwarranted by the data, and we must be very careful to place them into context to avoid dishonesty.

Finally, trustworthy models are just good business. If we hide uncertainty with overly-precise point estimates, we are likely to make bad decisions. Annie Duke writes:

A great decision is the result of a good process, and that process must include an attempt to accurately represent our own state of knowledge. That state of knowledge, in turn, is some variation of “I’m not sure.”

Trustworthy models in practice

But building trust is hard. We cannot simply tell our users to "trust the algorithm" and expect them to do so. Instead, Spiegelhalter argues, we should put our efforts into building models that are worthy of trust. When we relate to other humans, we understand that we must demonstrate that we are worthy of trust before we will be trusted. The same holds true for models.

A coworker of mine once asserted that trustworthy models provide at least three things:

Prediction -- An estimate for some unknown value
Confidence -- A description of how uncertain the model is about the prediction
Explanation -- The reasoning for which a model made its prediction

As an example, consider a doctor. If your doctor told you that your arm needed to be amputated, you'd never let them do it just based on that recommendation! You would ask for some justification first. She would explain that an infection in your arm could be lethal if it spreads. In this way, she builds trust with you.

These techniques which come as second nature to humans are not as automatic for machines. We often stop short of developing models that are truly worthy of our users' trust.

Here's an example of what the output of a trustworthy model could look like:

Prediction: My best guess at the sale price of this home is $324,000
Confidence: A 95% confidence interval on that number is ($315K, $333K)
Explanation: The home is large, but the fact that it's on a corner brings its value down.

This is a massive improvement over a single number! Not only is this a more honest statement to make, it helps users understand why the model gave its prediction. In turn, this helps users gain trust in the system and leads to better outcomes.

We have a responsibility as model builders to represent our work with integrity. Shipping a model that is implicitly overconfident is bad for our users and our businesses. Instead, we should develop models that are truly worthy of trust.

Sources

Duke, A. (2019). Thinking in Bets. New York, NY: Portfolio/Penguin.
Spiegelhalter, D. (2020). Should We Trust Algorithms? . Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.cb91a35a

Your brain loves exercise, even if you don't

Samuel Taylor — Wed, 15 Apr 2020 05:00:00 GMT

Effective Learning (a review of Ultralearning)

Samuel Taylor — Sun, 29 Mar 2020 12:25:19 GMT

Everyone advises being a "lifelong learner," but not all learning is created equal. Many effective techniques are underutilized, and many common techniques are useless.

We buy a book or attend a conference. If we're really dedicated, we may even jot a few notes down in the process. But rarely do we take a step back and ask what the most effective way is to develop the skills we care about. Jeopardy great Robert Craig says, "You can practice haphazardly, or you can practice efficiently" (NPR). Unfortunately, most of us are practicing haphazardly.

Fortunately, skill development is well studied. Two recent reads cover it well: Peak (by Anders Ericsson and Robert Pool; also covered in my last post) and Ultralearning (by Scott Young). Three key pieces of advice from these books are to develop intuition, focus on doing, and integrate feedback.

Develop intuition

If you've not watched Gourmet Makes yet, you're missing out! The show's seen wild popularity for many reasons, not the least of which is its host's intuition. Claire Saffitz has a wide range of experiences that she draws on to recreate classic foods. Her explanations of why she's swapping an ingredient or trying a particular technique reveal her mastery of the subject and are incredibly interesting.

This kind of intuition is explicitly identified in Young's book as one of the principles of what he calls "ultralearning":

In a famous study, advanced PhDs and undergraduate physics students were given sets of physics problems and asked to sort them into categories. Immediately, a stark difference became apparent. Whereas beginners tended to look at superficial features of the problem—such as whether the problem was about pulleys or inclined planes—experts focused on the deeper principles at work. “Ah, so it’s a conservation of energy problem,” you can almost hear them saying as they categorized the problem by what principles of physics they represented. This approach is more successful in solving problems because it gets to the core of how the problems work.

The experts in this story have better mental representations of Physics problems than do beginners. It's not so much that they see past the intricate details of each problem, but they are able to identify the details that matter most.

Ericsson and Pool go so far as saying that the "main purpose of deliberate practice is to develop effective mental representations". With focused study, we exploit the wonderful adaptivity of the human brain, quite literally reshaping it to be better at the new task. In an effort to minimize energy expenditure, our brains pick up on patterns and encode those in structures to increase our future effectiveness.

Intuition is often the outcome of a long career, but we can develop it more quickly. If we can get access to an expert, we can often gain intuition by understanding how they think about things.

You're not out of luck if you don't know such an expert! There's probably somebody writing about your field online that you can learn from. For data science, the winner's interviews on Kaggle's blog are an incredible resource. For software engineering, I find High Scalability has a great roundup of articles that can lead to a lot of insight about good design. Even Reddit is sometimes a good resource.

Focus on doing

When undertaking a learning project, be very clear about what you want to do at the end of it. Specific goals focus projects and ensure better outcomes. For example, if a data scientist wants to understand deep learning techniques better, she or he may decide to build a system for reading the sign language alphabet from a user's webcam. Without a specific project, it's easy to spend lots of time watching lectures or reading books that feel like productive uses of time yet don't contribute to real skill development.

I am explicitly not saying that books and lectures are unhelpful; on the contrary, they are often the most rich sources of knowledge. But without something concrete to guide our reading, we can waste time unwittingly.

I have always loved learning. I collect information like some people collect baseball cards. I find joy in having relevant tidbits of information to share with people. One of the things I'm learning, though, is that taking in information isn't an end unto itself. Ultimately, the thing that matters is what that information enables me to create, be, or do. Explicitly choosing a desired outcome for my learning projects helps me learn better.

Integrate feedback

Experimentation is key to mastery. We've got to try things, understand what went well (and what didn't), and integrate those learnings into another attempt. It's a feedback loop! Not all feedback is created equal, though. Young identifies three types:

outcome feedback, like receiving a grade on an exam,
informational feedback, where you're told what you're doing wrong (but not how to fix it), and
corrective feedback, which goes beyond mistakes you're making and includes ways to fix them.

Of course the last type is the most useful, but it is also the most difficult to get. In Peak, the authors advocate strongly for the value of a coach/mentor largely due to their ability to provide feedback. YouTube is a great way to start learning guitar, but at a certain point you need a human being to provide specific, individual feedback.

But all is not lost if we have no coach! We can use a number of techniques to gather feedback on our own. One I find interesting is the Feynman technique. To start, write a problem down on a piece of paper. Then, explain the solution as though you were teaching someone. Walk through not just the steps for solving it, but the rationale behind doing so. The most valuable feedback in this process comes when you get stuck; the parts that are hard to explain illuminate where your learning can go deeper.

Closing recommendations

If you're curious about this stuff, I recommend both Peak (here's my review) and Ultralearning. While they overlap significantly, the former has more insight on organization-level training and the latter is better for individuals structuring their own learning programs.

Life's too short for easy learning. Spend time doing the hard work of learning difficult things well. Do so by developing intuition, focusing on doing, and integrating feedback.

How to train data scientists and engineers: a review of Peak

Samuel Taylor — Thu, 12 Mar 2020 17:19:13 GMT

Most companies with personal development budgets are wasting their money. If the goal is to help employees master valuable skills, then we are misallocating funds to books and conferences. Instead, we should look at what the research on skill development says so that we can make decisions informed by data.

Great people are hard to find, especially in software engineering and data science. The people who you want to hire probably work for your competitors and make more than you can afford to pay them. If instead of finding employees that are already great you could help employees become great, then the process of doing that would become a huge competitive advantage.

I recently read the book Peak by Anders Ericsson and Robert Pool. In it, the authors espouse the value of deliberate practice and offer research-backed insight into effective training practices. How might a company go about creating top-tier talent?

Before we talk about what works, let's think about what doesn't:

attending lectures, minicourses, and the like offers little or no feedback and little or no chance to try something new, make mistakes, correct the mistakes, and gradually develop a new skill

Unfortunately, many corporate training budgets are set up to send people to lectures and minicourses rather than building good training programs. The authors offer advice on how we might train doctors, though we could adapt their recommendations to the tech industry.

The first step is to find the experts. Ideally we could do this on a global scale, determining the greatest data scientists in the world. Unfortunately, even determining a methodology to find these people sounds prohibitively complex and difficult. Fortunately, we can settle for an approximation. Bringing an average engineer to the level of elite talent on the world stage would be incredible, don't get me wrong, but even getting that person to perform like the best engineer at the company would be a huge win.

Finding the top talent at your company may involve: asking various individuals who they hold in particularly high regard, examining performance review data, and/or determining which individuals have had the greatest positive impact on the business. If you're fortunate enough to have access to brilliant individuals outside your company, all the better!

Once we've found these experts, we move on to step two: understanding how they think about problems. Ericsson and Pool go so far as saying that the "main purpose of deliberate practice is to develop effective mental representations."

Having done this crucial work of understanding highly effective individuals, our third step is to build a "Top Gun" school. Modeled after an effective strategy for training fighter pilots, Peak's authors recommend creating training programs that simulate the real thing as well as possible while dramatically lowering the cost of failure. In our industry, this might mean identifying a few JIRA tickets representative of a team's work and having trainees work them under the watchful eye of high-quality instructors. These instructors should point out failures to their students and help the students to develop the thought processes (i.e. mental representations) of high performers.

Note that we're not developing coursework on design patterns, data warehouse design, deep learning, or a certain JavaScript framework. Instead, we focus on doing the work:

One of the implicit themes of the Top Gun approach to training, whether it is for shooting down enemy planes or interpreting mammograms [or developing a predicting widget manufacturing capacity], is the emphasis on doing. The bottom line is what you are able to do, not what you know

Building great teams isn't easy, but it is incredibly valuable. More than just spending $50 on eBooks, companies should create training programs that use data-driven insights into skill development to bring each team member to the level of our best performers. Great products come from great teams, and great teams are formed of great individuals. Great individuals are formed through deliberate practice.

How to Handle Class Imbalance

Samuel Taylor — Wed, 30 Oct 2019 14:04:12 GMT

Alternate title: Help! My Classes are Imbalanced!

Delivered at:

ODSC West 2019. Slides available here.
DeveloperWeek Austin 2019. Audio available here.
Aspiring Data Scientist Community. Audio available here. Transcript below.
AnacondaCON 2020

Find me on Twitter @SamuelDataT.

Transcript

The start of this story is that at one point I was in undergrad, and I was in this machine learning class (the first machine learning class I'd ever taken). And I was working on this Last.FM dataset trying to build a recommender system, because I've always thought it would be really cool to build a computer algorithm that can help you overcome information overload. And I thought this would be a cool way to try that. I started working on this little handwritten thing, and I was feeling pretty good about it. And then I walked into my professor's office and I said, "Hey, you're not gonna believe this. I have a 99% accuracy rating already, and I've barely even started."

And he doesn't react like I wanted to. I wanted him to be like, "Wow, you are a prodigy. This is amazing!" But what he actually says is, "OK, well, tell me a little bit more. What's the base rate of your problem?" And I say, "What was that?" And he says, "What would happen if you just predicted the most common class for everything? What if you just said nobody listens to anything?" So I tried that, and it turned out that that's exactly what the algorithm was doing. It was just saying nobody listens to anything. And we're getting 99% accuracy because there's so many artists in the world and there are so many people in the world that the intersection of that is going to be pretty small. So I was very sad about this. And eventually, I found a solution.

But the frustrating part about it was I kept running into this over and over and over again, this problem of class imbalance. So this is a talk that I wish I had when I was in undergraduate school before I walked into my professor's office and made myself look really stupid. I wish that I could have seen this.

Before we get started, I do work for Indeed. Every indeed presentation has this slide in it that says "We help people get jobs." If you like the idea of helping people get jobs, come talk to me. Today we're going to talk about class imbalance. This is the way in which we're going to do that:

We will start off with what it is
then we'll move on to how to figure out what is happening
then talk about some solutions for it.
And then at the end, I have some recommendations that sum everything up and try to tie a nice, intellectually tasty bow on this package, because there's gonna be a bunch of stuff.

So let's start off with what is class imbalance. Class imbalance happens when you have certain values of your target variable that are way more common than other values. For instance, this is a wine classification dataset. You may notice that there are orange points and there are blue points on this graph. And the orange points are way outnumbered by the blue points. We can say that this dataset exhibits class imbalance because there are way more blue points than there are orange points. There's a lot of things that can cause imbalance, and we're just going to walk through them because understanding them will help us understand the solutions there.

The first thing is there's just a lack of data. So this is a made up graph I have drawn. And let's say you have two features. One of them is on the x axis and one is on the y axis here. And then you have some points that are orange and some points that are blue. It's difficult to know -- where is the true blue region? Is it defined by an ellipse that covers both of these? Is it defined by a rectangle that covers that? Are there two separate ellipses? I don't know, there's just not a lot of data to really be able to infer that from what we have here. So that's part of the problem with this.

Another thing that can be problematic is overlapping. And this is from a paper I believe by Batista, where he and his co-authors talk about the fact that sometimes even if you have heavy degree of class imbalance. If you look at sort of the bottom right of this screen, there is an imbalance here -- you have way more blue points than orange points, but you can still draw a linear separator between these two things. So the problem isn't actually that bad. Really, because you can still separate them just fine. The problem only becomes worse when you start overlapping. And that's when you start to see these things more toward the top left. These points would be difficult to determine or distinguish one of them.

Noise is another important factor in why this happens. You can imagine that we have some blue region that is some set of points. And in the real world, you don't observe these regions, but for pedagogical reasons, go with me. So we have the blue region and orange region, we got some points, way more blue points and orange points. And just by the the sad law of large numbers, or our instruments being off or something, we measure these ones noisily. We accidentally think that these points are way higher than they actually are. And that means that we are going to incorrectly think that this is part of the majority class region (the blue region), when in truth it is part of the orange region. Because there were so many more blue points that we saw. We had a much better shot of reading some points. And they just got overwhelmed.

The one that I think is the most well theoretically justified is this idea of biased estimators. This comes from a paper that I have linked at the end of the slides by Wallace, Small, Brodley, and Trikalinos. It's called Class Imbalance, Redux. And it's just like, this really beautiful theory that they present. The crux of their argument is in this figure. Here they display a binary classification problem in two dimensions. So just along the x axis here will say that's our one feature. And then if it's an X, it's a minority class pointing. If it's a square, it's a majority class point, we can see there's a lot more squares and exits. And what they find is that you have, you don't actually have one distribution that you're drawing all the points from, you have two different distributions that you're driving From and you happen to see some from majority class a lot more often. So you're sampling a lot more out of this orange distribution than you are out of the blue pen. And ideally, in the greatest world or you can see where these distributions live, we could draw this beautiful idea of separating line and would perfectly to the best of our ability to separate these two areas. But because we have a class imbalance problem, the line gets biased toward the minority class it gets pushed in the direction the minority class and that means that we're going to accidentally cut off some of that region that should be part of the minority class and incorrectly allocated to be part of the majority class. So those are those are causes listed signee breath Okay, how do you recognize it the first thing just look for it like called on value counts, just know that it's happening because like, first time I ever ran this, it would think the check for I just assumed stuff, so don't make assumptions bad idea. The next thing you should do is compare stuff. This is a trick question. Everybody 97% accuracy. Is that good? Does anybody think? Anybody? If you Yeah, yeah, it's, I mean, when you compare it to this, like 97% for our fancy classifier, but 94% for a really dumb classifier, then it looks a little bit less impressive. It's like, Oh, we actually had an improvement of 3%. So one thing that that you can do, and that, like psychic learn provides is this API for a dummy classifier. And you can have to do a bunch of stuff, you can have a predict the most common class or just predict something at random. But I highly, highly recommend that when you run into a class imbalance problem. We're really whenever you run into a problem trying to stupid classifier that you can use to compare metrics and just sort of sanity check yourself. And this is like just just look at the next to each other and you'll realize oh, 97 isn't good, because 94 you can get by guessing.

This is the code for how to do that.

Basically, you just like import this stuff. And then you call fit and predict just like anything else, and you can look at the numbers and see for yourself how you're doing. So part of the problem with class imbalance. So far, all we really talked about is accuracy. And that is a huge part of the problem that we have is is accuracy just assumes that every area is the same, which is often not true. So let's use a medical example to really make this concrete, you can imagine that we might be scanning someone's brain for cancer, and we have images of malignant tumors and benign tumors. And we can make a mistake on either one. If we make a mistake on a benign tumor, that means we tell the patient Hey, we think your tumor might be cancerous,

we're

going to need you to come in for some additional tests. And there is a cost associated with that, like they are going to be worried and their families probably going to be worried and they're going to have to pay more for the extra tests and your staff is going to have to run the extra tests like there is a real cost there. But it is very different from the cost of making a mistake in the Molina case if we see in the limit And we accidentally say that it is benign. We're going to send someone with cancer home and say, Oh, you're good, don't worry about it. And then they're probably going to die because they didn't get initial screening. That's obviously terrible. So what this means is that if you use accuracy, you are implicitly saying that the death of a human person is exactly as bad as cold calling someone in traditional tests. And that's obviously absurd that you have a question you both can be correct. Yeah, yeah. You can be incorrect in like both ways. Totally. Yep. So one set of metrics that people will often like to use is precision and recall. So on this little diagram, here, we have some false negatives and true negatives. What this means in our medical example is that for precision, what we're saying is, of all the tumors that I see, that I say are malignant, how many of those actually turned out to be malignant and then recall it saying All of those malignant tumors in the world, how many? Am I actually correctly identifying, as I recall sort of a way of knowing like, Am I pulling back out? Am I recalling the points that I particularly care about? Another set of metrics that you should use rather than accuracy is the receiver operating characteristic curve. And the gist of how you do this is you. Most classifiers can give you some sort of probability or decision function. And then you can vary a threshold from zero to one, and then calculate surely false positive rates and you put them on this curve. This is nice because it lets you sort of think about how good is this model for various levels of a threshold and it sort of tells you implicitly about how well your model is ranking points against each other. And this is a method recommended by CNET in May and their paper about our seniors, and they find that when you don't know how bad the two alternatives are It's really good to use an rz curve because it sort of, you aren't required to know those things in advance. So for instance, if you don't know, is it worse for a user to see a job that they don't want to apply to? Or is it worse for a user to not see a job that they do want to apply to, like, kind of hard to figure out like one of those is twice as bad as the other, and as you don't know, is often a good place to use the area. And of course, the last thing I'll note on metrics when your accountability metrics and hope depression, by the you know, a way to fix that, if you do know the cause, yes, we will talk about that in a little bit. If you do know the cost, there are techniques that you can use for sure. And we'll talk about those in a second. So the last thing I'll mention on metrics is to be really careful with the way that you do your training and testing splits. So there's this really interesting paper that I didn't have time to put into this because it came out this year. But it's by like, the lead author is a person named Luke. UQ up and is like the slides. And they do some really rigorous research on the way that various metrics are affected by imbalance and changes in the balance.

And the gist is

that a lot of the metrics that you care about are probably going to be very different just based on the prevalence of the minority class. So if you just by random happenstance happen to get a test split, where there's 5% of the minority class versus woman 10% of the minority class, you're going to see dramatically different error numbers for those two different things. And that's not necessarily reflective of some sort of underlying truth. It's more of a reflection of the bias inherent in certain metrics. So would you want to what I would highly recommend you do is when you're doing training, interesting splits do a stratified split, where you make sure that you have a very similar prevalence of the minority class in each area in each slit. Okay, a lot of stuff. We'll talk about how to solve this problem. There's a lot of different things you can do. The first thing you can do is kind of like eat your vegetables like everyone knows it's a good idea to your vegetables. I'm like, I was eating tamales, you know, I, I know that I need to eat more spinach. And I know that it's good for me, it's gonna make me Make me last longer in life. And that's kind of what this is gathering more data is kind of the your vegetables like this is going to help you. It's it's good, but it kind of sucks. So if you can't do that, or if you don't want to do that there's a bunch of other techniques that will actually be the bulk of this discussion. This is sort of a taxonomy that is described in this really good survey paper by branko, torgo and Ribera, where they talk about three different ways that the research has kind of thought about addressing this problem. Those three being pre processing, special purpose learners, and prediction, post processing. So we'll go through each of those in this little chunk here. First pre processing. When we talk about pre processing, we are basically talking about taking our data set and either making more points or making fewer points, like changing the district Of these things versus each other.

So we're going to talk about oversampling. First,

in oversampling, what you do is you take the data that you have, and you make more minority class points, you can do that in a number of ways. The first way that I have up here is random, you just you just like take some points, and you just duplicate them in your data set. It's difficult to see that that's happening, because these points are all on top of each other. And this is in two dimensions. We don't have our fancy 3d glasses that are, you know, zoom into this. But in the other examples, you can see more clearly what's going on. Where in smoke for instance, they're creating new minority class points.

So smote is a technique that is used for

over sampling of the minority class. And this is sort of the algorithm of how do you do that? What you do, you take some minority class point, and then you find its k nearest neighbors, which you can, you know, hyper parameters or just figure out what the optimal value for K is. And then you pick a point is some percentage of the way between those points. So your interpolating points to making new points is the

idea, you keep doing this until you reach the level of balance that you want.

The way that you choose the point the the way that you choose which of your neighbors to interpolate between matters, and that's an area that has been researched further, the original paper just did it randomly. And that's what this smoke diagram he's here. But there's been like updates and further research on this where people tend to find that nadesan is a really good alternative, where instead of just picking randomly, you try to pick points that are closer to the decision boundary. And so you can kind of see that there's a lot of points on this smoke diagram towards the bottom. And like, we already know that the bottom is blue on this diagram, where we might need more help is when you're getting closer to those orange points up for the top, and that's what a disinterested do. You can also go the other way. So that's a oversampling idea. And that was your question. Yeah. So the whole Sunday is right. Good morning. Oh, that's automatically class. So how is it really gonna affect the decision? Because the team about this mistake as it is because they're just really morning for the minority class? Yeah, I have, I have some diagrams that I can use to explain this in a sec. And the gist is that, that thing really can't earlier with the two sort of lines in line getting pushed toward the minority class, when you have more of those minority class points that are able to sort of fight back and push the point that that separating boundary away from that, and then there are some diagrams I can show you in a second.

So,

so understanding is the other way you can go.

And coming back to our idea of noisy here. I want to remind you that if we zoom in on this little section, we're going to see these two points next to each other. And the insight of this particular technique is to say that if we have two points right next to each other, that are different classes, we probably measured one with noise and this I'm gonna be honest with you is a total Life here is like not a bad thing. It's just like probably, you know, probably. So in automatically when you're using them as an understanding technique, you just take the one for the minor from the majority class and throw it away, you're just like, Yeah, that one's probably not legit, let's just not care about it. And so that's an idea that you can do. The other thing you can do is way simpler, just randomly throw a points out the majority class until you get to some closer and some closer approximation of balance. And that's actually what the that's what the Wallace paper argues for. They find that these green lines here are sort of what happens when you when you under sample to get to a balance point. And you can do that in a number of different ways. And you can see that you get a lot of different planes like if you happen to throw a certain point it could draw, you could end up drying your plane in a different spot. And so that can lead to different amounts of error in each For each plane, but the nice thing is that all of these planes are less biased than the original one that we, that we inferred. So like this purple line that I had here was our original biased estimator. And all these green ones are at least less biased than that. Now it does suck that the error metrics are going to be different for all of them. And in the face of this variance, the authors just suck it up. And they're like, All right, here's what we're going to do, we're going to bag things together because value is a great way to trade off bias and variance. So we're just going to like, you know, sell some sell some bias and gains, you know, gain a little bit on our various strengths. So you can take your data set and understand play it under sampling in a bunch of different ways, and then diving classifiers further and get a more performance model. If you're gonna do any of this, I highly recommend this. And if you are using Python, then use imbalance learn because they implement a lot of this stuff for you. Okay, now, the deep breath pre processing Great, and it's not so great. So let's talk about when it's good when it's bad. It's great because the libraries already exists for a lot of this stuff, which is nice. It also kind of like gets the model closer to what you're looking for, it undoes some of that bias that we see from the, the sort of Wallace paper. Now, this isn't really a good or bad thing, but it does change the cost of training your model. So if you imagine that you have so much data that you can barely fit on your computer. If you're going to start oversampling, that data, you're going to have too much data and you won't be able to train your model or even, it's just going to make your model train longer. By contrast, if you are under sampling data, you're probably going to have faster model train times because you have less data. So that's not really good or bad things just need to be aware of when you're doing this technique. Now, it can be kind of difficult to apply this because you don't always know what level of balance you want to get to, like should I go to, you know, where it's instead of 1% it's percentages, I try to go straight to balance, it's not always clear. And you kind of have to explore that and experiment with that to figure out what the right thing is a second way this can be difficult to apply. If you think back to that smoked example, that was kind of dealing with real valued or floating point numbers, you can imagine if we had categorical data, it would be kind of difficult to do that. And that's when like, if you're doing the word count vectors or something, like what does it even mean to have point seven of the word apple in a documents like What does that even mean? So there are things to try to there are sort of adaptations of the algorithm to deal with categorical data. But their lesson, I'll say, Okay, next up on terms of various solutions to this problem, special purpose learners. You've probably already seen this, if you haven't looked at the documentation for any of your libraries. I just went through and found some of my favorite algorithms and copied the sort of doc string or the doc mutation and the highlighted a bunch of these have a way to sort of specify a class week. And what this is doing is depends, you know, it various models who do different things. So in three models, so this is going to affect is first of all the impurity calculations. So when you're going through and trying to figure out what feature Should I split on the impurity calculations, we get weighted based on the waiting nice, that's fine. And the other thing it'll affect is voting time at prediction. So if, for instance, you say that our minority class should count for

twice of the majority class, then if you get to the bottom of your, your tree, and you get to a leaf, and there is one majority class point and one minority class point, the minority class, by the way, because it has two votes, and the majority class one only has one vote. You know, it's different in different places, for SVM, so what this kind of does is push the hyperplane that it learns away from the minority class and this is cool because it kind of does the same thing of undoing Be undoing that bias that the walls paper talks about. And they find actually that doing this week, this week based minimization is very similar to doing like an over sampling technique, and does the same thing, it will just regression where it just sort of wins back some points for the minority class. In k nearest neighbors, it changes the distance metric. And then like whatever else that changes whatever else, like every different algorithm waiting the way in the minority class is going to do something different. And that's, that's, you know, cool because there's all sorts of interesting things you can go off and learn about, but it also sucks because it means there's all sorts of interesting things that you could go off and learn about. So instead of talking about every single different thing that that class waiting does was talking about when this works and when it doesn't. That way. If you're in into this situation, you can think okay, should I do last week? So, the research that I have read finds that when you have a really highly recommended Waiting is less effective. This is from that Wallace paper and they basically just say like when we have a lot more imbalance the chance that using this waiting technique is going to work is just less effective and it's going to be more effective with more data this is where the whole feature spinach thing comes back in because more data is always going to make this better. And they they have a like actual like really good theory backs like equation theory of imbalance that they draw this reasoning friend and I highly recommend you very if you're gonna read one of these papers is the last one in my opinion, because they haven't really interesting stuff to say. So good things bad things on special purpose is very sort of two thirds of the way through these right now. Jim Christian Yes. Yeah, that was so fun, more data. Data is embedded so is waiting because Let me see if it is effective for Right. Yeah. Yeah, so the question the question sort of being, if I have a lot of data and I had my main balance, like, what what do I like? Is it going to be good or not? And I think, like, well, this doesn't really make an argument for what happens in both cases, it just says for a fixed value, they say that for a fixed value

of imbalance, so let's

say you have a dataset where only 1% is in the minority class, if you just get more points class rating will start to become more effective. And then it will also say that if you ever given data set size, and one day is that happens to have 10% versus 1%, the 10% that is that is going to have better results from using class waiting. Okay, good things, bad things, good things. It directly addresses the issue. Bad things you have to still know like, what is your cost benefit here? Which points do you care about more and how much do you care about certain points at other points, it's nice that this works when you have a lot of data or when you are a little bit closer to balance. But if you don't already have this class waiting idea in the algorithm, it's going to suck, you're going to have to really get a deep understanding of the algorithm. Maybe write your own implementation of it and figure out like, Okay, what part of this algorithm is suffering from bias? And how can I do that? And that's difficult to do? Sometimes. Okay, so the last kind of group of techniques that they talked about is prediction post processing. And there's two things I want to talk about here. First, is threshold selection. So oftentimes, when I ran into a class imbalance problem, my first thought is, can I just turn this into a ranking problem and not have to have to worry about this so much? Where you might think of something like I want to send an email that has 10 jobs to a user, the 10 jobs I think they are most likely to like in that case, you know, the user probably We'd only likes point 1% of the jobs that you know about because they work in an industry. But you don't really need to worry about that too much if you can rank the jobs against each other. And if one has a, you know, point 05 percent chance of them liking it, and that will rank above a lot of the other stuff. And so you can pick the top 10 in terms of whatever criteria you have. If you can't do that, though, it is foolish to just use the default threshold that's like it gives you so you should very specifically choose the threshold that you want to use to optimize your metrics. So if you have to specifically pick that number of like, what percentage chance do I need to choose where above that point, I'm going to call someone back for their cancer screening. You need to pick that number really carefully. The gist of how you would do this is you would get the probability output of your model, and then figure out what metrics you care about. I put precision and recall on the screen Because we talked about them earlier, but you can use whatever it is the metric that you haven't they care about. And you want to vary that threshold and measure the various metrics that you care about. And you can then look at this and make an informed human decision on what you want to do. So we imagine, in this grain cancer case, we are probably willing to sacrifice precision for recall, we are willing to accidentally call some people back that are just fine, because we would rather do that then let someone who has cancer go on No. But there's other cases where it's exactly the opposite. Like, if I am trying to like, like fingerprint into my phone, the the, the cost of me missing it is like I get a little annoyed that the cost of someone you know, getting into my phone, and that not being actually me is there's a lot because they're going to get in there and they're going to read my memes and tell me they're not funny and that's gonna hurt my feelings. I'm gonna have to go with their first one last night on this. Don't use your head. Set to do this, like do some sort of like crash validation thingy over your training set, but don't use your tests that are also overfit your test set. And that'll be really sad. And the other post processing technique that people talk about is an idea of cost based classification, this more directly gets to this idea of there being direct costs for a false positive and a false negative. There's a couple papers here, we're just going to talk about the senior and a paper where what they essentially find is when you have an RFC curve, each point on that will correspond with some sort of threshold like that's the threshold for, you know, point six, four, if there's a 64% chance of this person having a having cancer will call them back right. And we have a true positive rate and a false positive rate. We can use that to calculate costs with this formula. It's not as scary as it looks. The gist is we take the probability of it being in a negative class and multiply that by the cost of a false positive and then multiply that by a number we can read off with our secret

And then we read, basically we do the same thing, but for the positive class. So we figure out how big is the positive class, what is the cost of making a false negative, and then read this number off of the RFC curve. So then we can sort of add all this stuff together, read these numbers off. And the sort of these orange numbers here I have put into signify we have a case where the minority class is 10% of the data set. And then I have had these blue numbers in here to say that a false positive is five times as bad as a false negative, I happen to know that my priority I can plug these numbers in, we can calculate these these values of cost or different thresholds and then choose the one that is the best one, and in this case, 3.6, for that. So going back to that equation really quick, in my correct understanding that the costs if they can be expressed in relative terms, any set of integers or floats that have that, that how that relationship works. So you don't necessarily need to express the costs in real life. You can just say A false positive versus false negative is x is x times worth more or less costly? Yes, yes, you're exactly correct that before. So this formula doesn't like these don't need to be real life costs. They could be if you happen to know that a false positive costs your company's $7, and the false name cause it to you could use that. But if you do, you can also just like, come up with some thing. I'm gonna say this is five times it's worse than five times a bad or whatever. And those can be whatever numbers you want. Yeah. Anyway, so then you pick the one with the lowest cost. And that's the threshold to us. I want to point out this is different from the idea of special purpose learners. And so the first couple times I did this, this is a little confusing, and I realized, because I didn't talk about it, so they're different. The idea of special purpose learners is that you're modifying the algorithm to have this idea of waiting built into it in cost base conservation when we're trying to choose the right threshold. We're doing this after we've already changed the algorithm. raft regarding training model. And because of that, this means that we can use it out of the box of almost any model as long as it provides some sort of probability output or some sort of decision function, which is very nice and means you don't have to go fiddling about with the interior of every single model that doesn't have this bacon. So good things, bad things for prediction, post processing, is pretty straightforward. It's just like, pick the threshold. That's the best one. It's, it's a, it's a simple idea to explain. And it kind of gets at what we're looking for, which is to try to optimize our metrics for some value. And it is also nice that you can use it with almost anything because most models provide some sort of position function. A problem is that you is that this is not really studied a whole lot in specifically imbalanced domains. Like the the survey paper that I am referencing for a lot of this, they only found two papers that even talked about this and neither of those were specifically about class of balance issues. So, a lot of things, right like let's talk about just some some highlights here, hit some some recommendations. Well, I think based on what I read and what I have done in my life, and here's some some things I think about. I kind of think about this like a Maslow's hierarchy of needs, like you need to have like clean air and water before you start worrying about food. And then once you can worry about foods and not worry about shelter, like eventually you get a runs up and you start caring about, you know, your therapists, calming you down from having bad names. So the the first level here I would say is just establish some sort of baseline like a train that dummy classifier, train something stupid, and compare it to your actual metrics. Unless you have a good reason otherwise, which you very well night. If you don't know what to use, I would recommend using the area under the RF seeker because it is unbiased in these situations for a very specific meaning of law biased. The next thing up from that is if you can try using classmates like just try it Saying that your minority class is a certain amount more important. If your model if your algorithm supports that. And then from there, I would recommend picking your threshold smart. Like don't just use point five. That's probably not the right answer. It might be, but it probably isn't. Once you get to that, like if you're trying to eat more performance out, or you're trying to address something else, I would say at this point, start using a random sampling technique. We talked about fancy methods, we talked about smoke, we talked about Tomek links.

The research did not bear those out as being super great.

So the Wallace paper actually, they make a really strong recommendations that in almost all inbound scenarios, practitioners should bag classifiers ever induced or unbalanced bootstrap samples. And then there's another paper by Battista proxy and modar, where they find that random oversampling is really competitive to these really more complex over sampling techniques like smoke. Which these kind of seem to disagree because one says always understand one the other says always oversample my way of justifying this to myself is that the Battista paper doesn't take into account this bagging element they just do under sampling. It's I think there's a lot of variance in what they're seeing, because they're not doing this bagging. Yeah. So But either way, just try the random methods first, because they're probably good enough. And only once you've tried doing that, do I think it makes any sense to start worrying about smoke or any of these really complicated techniques? Because really, at that point, you probably have a bigger problem that can be solved by just some fancy algorithm. It's, it's you need to go find more data. You need to figure something else out.

Thinking in Bets for Data Scientists

Samuel Taylor — Sun, 20 Oct 2019 14:03:36 GMT

Data scientists are uniquely positioned to provide leadership on their teams around risk and uncertainty. We are trusted by our coworkers to have an understanding of experimentation and data-driven decision making. This trust can be leveraged to improve processes, decisions, and (ultimately) the output of our teams.

In Thinking in Bets, Annie Duke describes how to make good decisions, informed by her time as a professional poker player. Life (she argues) and data science (I argue) are like a game of poker. In some games, like chess, each player has perfect information. They know where all the pieces are, what moves those pieces can make, and the conditions for winning. But poker is a game of imperfect information. Each player knows only the cards in their own hand. While they can intuit things from the body language or play style of other players, that intuition is not perfect. Some players are really good at bluffing. Some behaviors are easy to misread.

Poker, thus, requires players to make decisions in a system with imperfect information (often with high dollar amounts on the line). Doesn't this sound like life? The stakes are high, and we don't know what the future holds, but we have to make some decision. By Duke's definition, that's what a bet is: "a decision about an uncertain future".

In the face of uncertainty, we data scientists resort to experimentation. We can determine the best color for a button or the best copy on a page or any number of other things by running a well-designed experiment. While this is valuable work, it only scratches the surface in terms of where we can apply good decision making.

Running experiments is only useful insofar as they help us unlock real value (i.e. moving our OKR's or KPI's). Think about it like driving a car. When you press the accelerator, the tachometer shows an increase in the RPM at which your engine is turning. This turning is then put through a system of gears and eventually spins the wheels [0]. The ability to run experiments quickly is akin to being able to turn the engine quickly. Without a good system for choosing the right experiments and leveraging their results, we limit our ability to impact the business. When we view decision making as a process of choosing the right bet, we can choose strategies that help us make better decisions and have greater impact.

Red Teams

The strategy from Duke's book that I found most directly applicable to my work as a data scientist is Red Teaming. Established after 9/11, these teams have as their express goal "arguing against the intelligence community's conventional wisdom" [1]. By "spotting flaws in logic and analysis," red teams help drive intelligence agencies closer to both the truth and a proper understanding of the uncertainty in analyses [2].

Within weeks of reading this book, my team and I happened to be working on understanding our KPI's better, which involved some new analysis. This seemed like a perfect time to apply a "red team" strategy -- as someone would posit a result, I would see if I could disprove it. Whether I could or couldn't, I reported on both. And then other members of the team would try to prove or disprove my result! By this collaborative process, we came to understand the truth of the situation where we could have easily misled ourselves.

If you want to try this out, here's a few techniques I've found useful:

Explicitly try to disprove an analysis. If you have sufficient time, working to show how an analysis is wrong can be really valuable, even if it withstands scruitiny. You will likely find some small issues in the way something is calculated, some ambiguous terms or metrics that could be misunderstood, or an invalid assumption. These findings can lead to further quantification of their impact on the original analysis. In this way, we can gain a better understanding of how confident we should be in said analysis.
Try to reproduce an analysis. Avoiding looking at the code for the original analysis, try to get the same result via a slightly different pathway. If the original author used raw log data, see if you can answer the question using the data warehouse. Come up with new metrics that should move in the same direction as those used in the first analysis. If two people come to the same conclusion independently, we gain confidence in that conclusion.
You've got to be careful with this one! Knowing the hypothesis that is being tested can skew the analysis you do (even unconsciously) [3].
Answer an adjacent question. Sometimes there just isn't enough time to fully reproduce or disprove an analysis. In these cases, we can test an upstream cause or a downstream effect instead.
For instance, if our analysis finds that sales of trucks decrease when gas prices are high, we could look up years in which gas prices were high and see if truck sales were down.

While these techniques are most effective when applied by a separate person who hasn't been influenced by the same process/data as the original author, I have found value in explicitly shifting my perspective to "red team" myself. Working specifically to disprove my own analysis, I end up understanding the results in greater depth.

Be humble

Humility is a key element of truth-seeking. We must remember that the point of our work is not to prove to our teammates that we are geniuses; we're trying to produce some positive result for our employer. We are more likely to find the truth when we seek it rather than pursuing our own glory.

I believe that sometimes our drive to compete gets in the way of our humility. I am not a particularly competitive person by nature, but if you're reading this and you are competitive, Duke has some advice for you:

Keep the reward of feeling like we are doing well compared to our peers, but change the features by which we compare ourselves: be a better credit-giver than your peers, more willing than others to admit mistakes, more willing to explore possible reasons for an outcome with an open mind, even, and especially, if that might cast you in a bad light or shine a good light on someone else. In this way we can feel that we are doing well by comparison because we are doing something unusual and hard that most people don’t do. That makes us feel exceptional.

When red-teaming my own analysis, I sometimes find things that cast doubt on it. Sharing my results, I'm tempted to leave these observations out. The desire to present my findings in the best light possible is (I believe) a natural one, yet one I must work against. Duke writes that "if we have an urge to leave out a detail because it makes us uncomfortable… [it is] exactly the detail we must share."

Bringing reasons to doubt to the table along with the analysis itself helps know what information we need to get. Sometimes a little bit of additional analysis can alleviate the doubt. Sometimes the concern will turn out to reflect a small enough risk that we don't need to address it. And sometimes the only way to get more information is through an experiment. The important thing is bringing uncertainty to the table so we can address it directly.

Communicating uncertainty

We communicate about uncertainty all the time. When asked if we're going to an after-work social event, for instance, we say that we "might go" (which typically means we are definitely not going) or that we will "probably go" (it's a bit of a tossup). These phrases are examples of words of estimative probability, or WEP's. In colloquial usage, these casual WEP's are just fine, but they are less helpful when we're trying to make a good decision.

For one, different words mean different things to different people. Andrew Mauboussin's research shows that words like "maybe", "probably", and "usually" are interpreted to correspond with wide ranges of probabilities depending on the audience. For instance, when someone says that an event "might happen", her or his audience could interpret that as an event with probability between 25% and 55%. That's a huge range!

By using WEP's in communication, we run the risk that our audience will misinterpret the likelihood we think a certain event has. But there are ways to overcome this. Mauboussin advocates for explicitly giving a percentage alongside words of estimative probability. This approach is used in the medical research field, where institutional review boards require researchers to inform people of the risks in treatments using WEP's [4]. These words should be accompanied by a percentage; for instance, a researcher might inform a participant using language like, "This side effect is rare (will happen to less than 1% of subjects)".

Parting words

Certainty is alluring. But as data scientists, we should know better! The world is filled with uncertainty, and only by defining and quantifying it can we drive toward an accurate understanding of reality. This understanding, then, enables us to make higher quality decisions. And this improvement in decision-making doesn't have to stop at the individual; we can bring this idea to our teams, departments, and companies.

Footnotes:

I think this is how it works; I'm really not much of a car person.
Neal K. Katyal. 1 July 2016. "Washington Needs More Dissent Channels", The New York Times
Ibid.
Duke here references Richard Feynman, but I can't find a direct citation. Still, this seems to jive with my own experience.
University of Tennessee, Chattanooga.

Using Open Source Tools for Machine Learning

Samuel Taylor — Sun, 13 Oct 2019 21:07:55 GMT

Delievered at:

All Things Open 2019. Slides available here.
PyData Austin 2019. Recording available here.

Find me on Twitter @SamuelDataT.

Transcript

Have you ever applied for a credit card? I know, I have. A few weeks ago, I was up late at night. And I suppose that my idea of a fun time is to try to get some credit card rewards. So all my friends are out partying, and I was like, "Man, I'm going to get these airline miles!" I start applying for this card. And I get through a certain point, and it has me fill out all this really personal information. And then I click a button that says submit, and the page loads and within the split second, it's telling me whether or not I got this credit card, which like blows my mind, because I'm sure they don't have anybody reading this application at 2am. And secondly, even if they did, that person clearly can't make a good decision about this within a split second.

So I'm like, how did they do this? And the secret is that they are using machine learning. In other words, what they're doing is looking at past information and using that to come up with some math and Radical formula for determining whether they should extend me a line of credit. Because math is really fast that page can learn really fast.

I work for a company called Indeed, as a data scientist. This is a slide that is in just about every presentation that is ever given at indeed, people are really serious about this whole like mission that we have as a company if we help people get jobs. So I feel very obligated to have this in the the slide deck. And it would not be a true indeed presentation. If I did not mention that we help people get jobs. If you're interested in any of this stuff. Please come talk to me afterward. I would love to hear what you are into.

For those of you in this room, hopefully you have some idea of what this talk is to just clear up the room clear, clear the air a little bit. We're going to talk about machine learning today. This is sort of an introductory level of talk at the same time, is extremely friendly to newcomers. So if you know nothing about machine learning, you are welcome here and I'm incredibly glad You're here, you're going to learn a lot today. And it's going to be really fun. At the same time, if you do know a little bit more about machine learning, I hope that this is helpful to you. I know that in my own experience, I find it really helpful to see what other people are doing and the different ways that they're applying machine learning techniques. So I hope that by this sort of use case approach, that we're going to be taking today that you'll be able to see some problems that I've run into and maybe think differently about your own problems. We will be going through machine learning in an applications kind of way. I have found that in my own experience, I learned best by doing and while we can't necessarily all do in this room, I think stepping through real world problems can help us understand why we need certain things and machine learning a lot better than a strictly theoretical approach. At the same time. I respect the theory of machine learning quite a bit. There has been a lot of really good science and research that has gone into the theory of machine learning, which can help us do this job better and help us apply machine learning a lot better. So we want to make sure that even though we focus on application, we respect the theory.

There's a lot of things this talk isn't. First of all, I don't have a PhD. So you don't need a PhD to do this. There's not a specific credential that's going to make you great at machine learning. But because we only have 45 minutes today, this is going to have some code examples in it for sure. But there's not it's not like a tutorial style, like hands on thing that's happening in here. So this is not like an end all be all reference. My goal is that by the end of this, you'll have some idea of a machine learning is and sort of have your appetite wet and want to go learn more about this. come up to me and talk to me afterward. I'm happy to send more research resources along or talk to you about what good next steps are.

Here is the way that we're going to be doing this today. We will start off with some stuff that we just need to know some groundwork on what machine learning is and then We're going to be walking through a set of use cases and each of these use cases, we will discover something about machine learning something about maybe some, some different techniques that we have to apply.

Let us start off with just what even is machine learning. Okay, if you're in this room, and you have heard the phrase machine learning before, can I get you to raise your hand? Okay, it looks like we don't have any liars in here, which is great. A lot of times when I've asked this question, there will be people who have like, won't raise their hand. I'm like, You're lying. Like I've said it already. In this time. Come on. If you feel like you've used machine learning, like definitely in production, or if you if you've used it in some way that you found interesting or fun, could you get a raise hand? Okay, awesome. We got a great mix people in here. So if you are one of those intro people and you're a little bit shy of coming up to me, one of those people who just raised their hands I'm sure would also be very happy to help you. Cool.

So let's talk about machine learning. If you're new to this, I have found that getting into an Topic can kind of feel like riding the subway in a foreign city, where you'll walk up to someone to say, Hey, I'm trying to get to this place, how do I get there, and let's say, Oh, just get on the red line to this stop, and then take that to the green line. And then then you're there, it's fine. But if you don't know what the subway system looks like, it's going to be difficult for you to put that into your own mental map and remember it and get through. So to provide a little bit of a map. This is a sort of hierarchy or taxonomy of machine learning that a lot of people use, where we talked about supervised problems and unsupervised problems. And then there's a lot of other stuff that is in the field that we won't really talk a whole lot about today. We'll start with supervised machine learning. And that's what the bulk of this talk is about. supervised machine learning is machine learning where you have to find inputs and defined outputs. We sort of break this further down into classification problems and regression problems.

And we'll talk about classification problems first. So this is that example that I gave at the beginning of whether or not we're going to give somebody credit cards, the input data is the stuff I have highlighted in yellow here. So someone might come to us and say, Hey, I am 50 years old, and my net worth is $250,000. And from that we make a decision on whether or not we want to give them credit. Obviously, this is a simplified example. And as you can tell by this hand drawn illustration, I made this data up. I don't know like there probably is a super rich 12 year old out there who's just like, has $500,000. But that just ended up happening because of the way enter these things to be easy to explain.

So what we do in classification problems, and for all of the hype and excitement, and joy that there is around machine learning, the dirty secret is that all we're doing is drawing a line. That's like the whole thing that we're going to be talking about today. It's just fancy line drawing. And the real magic of machine learning, which is really just math is figuring out good ways to draw those lines that end up being helpful to us in the real world. So we draw this line And then that's our classifier is one thing you'd call it, you could call it a model. And when we want to then understand for a new point is this person someone we should give credit to, we can just put the point on the graph and look up, okay, is this on the approve or deny side and in this case happens to be on the approved side. So we say that we would approve this person for a credit card.

regression is another kind of supervised machine learning. It is very similar to classification in that we have input data, and we have output data. In this example, the input is on the x axis, someone's net worth, and the output is on the y axis the size of loan, we are willing to give this person. In this case, we only have one input that one value for input, which in the last example, we had two different input values. You can have as many as you want, or as few as you want, as long as you have at least one because if you have no input information, you're just rolling the dice. Anyway, again, what we're doing here is we're going to draw a beautiful, fantastic line. This is our line that we end up drawing, it seems like it kind of gets close to a lot of where these little exes are on our graph. And we say that this is our model. And then we will when we want to use it, say a new customer comes in and says, I have $500,000, they will tell us that, and then we will draw a line from where they are on the x axis up to the line and draw over to the y axis and we can determine this is the line of credit, we're willing to extend it to this person.

So that is all supervised machine learning when we have a defined input and defined output. unsupervised machine learning, as you might guess, from the name is different from supervised machine learning. A key algorithm here that you will probably run into is called clustering. And it's kind of weird, because in the last examples, we had inputs and outputs. And in this example, we just have data, we have some shapes with colors. And we're like, cool, I have this data, I want to understand something better about it. One thing we might do is walk up to a computer and say hey, turn this into three groups for me. I didn't Here, look, I made these great groups for you, these are amazing. Or it might say, Here's these groups, their group I shape, but look at how great this is. And you can do it thank you computer. This is very kind of you to help me understand the underlying structure of this data. But we don't have a defined input and output that we're trying to figure out.

There is a lot of other stuff in the field of machine learning is a lot of really active research going on. These include things like reinforcement learning and active learning techniques. I will say this, this looks like it's a presentation slide. For me, this is actually an insurance policy against people being mad at me on Twitter. So if your favorite algorithm isn't mentioned, it is Look, it's right on this slide. Yay.

So to summarize, in machine learning, we are trying to use data to approximate some function that we care about. We have some f(x) and that takes an input of x. And in this in our classification example, that would be someone's age and their net worth, and we want to predict whether or not we should give him credit. The problem with that is that we don't know what that function is necessarily for instance, if we're trying to determine whether somebody will default on a loan, they might have some weird random medical expense that causes them to default on the loan. And there's no way we could have known that. The thing that gives us hope is that we can gather some data that is measured from this f of x, but it has some noise associated with it. So there's going to be those situations that we don't know about, and we can't figure out and, and these machine learning techniques all try to get around that noise and try to understand what the truth underlying the noise is.

So in summary, the way I think about machine learning is that it is a set of algorithms which attempt to find a g of x, which is a good approximation for f of x.

With all that said, Let us begin with our first use case of the day. And this will be the credit card application stuff on each of these. We will be going through these five questions. So if you if you get this flow, that's what we're going to be doing all day. The first thing we'll do is talk about what the problem is. And in this case, we're wondering if we as the bank should issue a certain consumer credit card. The data looks something like this. This is that same data that I showed earlier, it's just a pretty good table because I figured out how to use Google Slides better later on this. And what we see as we have the input of age networks, and the output is whether or not to give them credit.

This is the most fun part of this entire presentation is going to be this question. This is the audience participation part. If you have some pent up anger, you feel like you really need to have input into this discussion. This is your time. shout at me. What kind of machine learning problem is this? classification anyone else think it's something else? Wow. Very, very hive mind, but very correct. Good job. Yes. Yes, this is classification.

Okay, and the next thing we'll talk about is the solution. There are so many good machine learning libraries out there. No matter what language you're using, I am pretty sure that it has a great open source. library for machine learning. In Python, there's one called psychic learn that is incredible. Our has just a whole set of things that you can do. Java has a library called Wicca. Today, we're going to be using Python because it's basically executable pseudo code. And I believe that everyone in the audience will be able to follow along. So let's do that. This will get you familiar with the way of using scikit-learn specifically, but other libraries have similar thoughts and similar patterns.

The first thing we do obviously, is just import the class that we need to use, we can then set up the data and I've drawn that graphic and on the right here, so you just see that it's that data. Then we instantiate the object. And now we get to the critical part for psychic learn. There's always these two methods when you're building a model you always fit first, and that will actually do the work of figuring out where this line is. And then the next thing we do when we have a question about a new point, we call predict, and then that will tell us whether this point is supposed to be a approved credit or reject credit. Fit and then predict.

So we might not wonder, how accurate is this? Is this a good model? And in this case, because I drew this data to be purposely easy, this is an incredible model on this data, it is 100% accurate, it is beautiful. But in the general case, it's not, it's not always going to be that easy, that have a situation like this. Let me tell you how I made these graphs. The, what I did was I made up an F of X, I just invented a function f of x. And then I drew a bunch of points from FX and added some noise. So that's why you see this sort of scattershot thing around probably where the true function is there, right? By show of hands, who thinks that model a here this blue line that sort of sloping down into the right is doing better in terms of error than Model B does? Does anyone think that model a is closer to the true function, the model be? Right zero hands up? This is not a trick question. You are Model B is much closer to the true function, which is this green line going up into the right. And this is easy for humans to figure out.

But we need to come up with a way of helping computers to clear this up. The way that we generally do this is by randomly splitting our data into testing data and training data. This can be done with a function insight at learn. That's called train test split. But the essence of what we do here is take our data and just randomly assigned 20% of it to be testing data. And when we do that, because that assignment has been made completely at random, when we train our data on the training data, we can then call predict for each of the testing data points, and figure out whether our model was right or not. So then, when we have that done, we see these little colored regions on this graph. And you'll notice that on each graph, the colored regions are the same, but the points are different. The reason the colored regions are the same is because you should only train your model on the training data has what's called the training data. And then when we test it on the testing data, we can get some estimate for how Good, our model is going to be at real world data. So you can see that we made some errors here. Even in the training data, there are errors. And then we can see that there are some errors in the testing data. And we might see that out of our, you know, 20 points we have here to wrong. So we might expect that, for new points that come in, we're going to, we're going to be wrong on roughly 10% of them. And that might be acceptable and might not be acceptable, depending on your problem.

Ideally, at this point, you would just calculate the real cost of each of these kinds of errors. So what I mean by that is, if we're trying to predict from radar data, whether there's a warhead coming at the United States, and we say that there's a warhead coming at the United States, and there actually isn't one, that is a huge mistake, and like a bunch of people are going to die, probably everybody, which is really bad. By contrast, if somebody is able to put their finger on my phone and get into my phone, they're going to go to my gallery and they're going to read all of my means that aren't funny and they're going to make fun of me and that's going to hurt my feelings very deeply. Which is, you know, that's going to be ours in therapy. After that. Sure, and that's a much higher cost than if I have to tap the back of my phone with my finger again, it's like, I'm going to get a little annoyed. And in real problems, you kind of see this often happen, where different errors are different levels of costly. So ideally, at this point, you would be able to figure out what the cost of your model is, and then figure out which one is the best.

In real life, we don't always know what the real cost is. So we use these error functions. To help us figure out what a good model is when we don't know what the real cost is, means where there is a really common function that we use to determine whether regression model is doing well. I have a graph here of the true values and the predictive values in for a certain data set. And we take you know the, we take the predictive value, the true value, subtract them, and then square that difference to get a positive number. And then we add them all together and divide by the number of points to get a mean. And now we say this model has an air of 18. And then if we were comparing it to another model, and the other model had an air of 17, we can know that the other The model is better than this one. For classification problems, we could say that if we have some points, and we have determined that in real life, half of them are blue, and half of them are orange, but we predicted that three of them are blue, and one of them is orange, we can see that we've made an error in that one case, and then say that our classification error is 25%, roughly 25% of time we get it wrong.

So lessons learned is the last part of each use case. And this case, this stuff is pretty neat. Like it's not hard to do this, you can pip install something and get up and running in a matter of minutes. And it's not as intimidating as it might sound. Another important lesson that we learned is that when we split out our data into training data and testing data, we know we can get an estimate for how good our model is.

Let us move on to our next use case. This will be talking about teaching a computer sign language. So what's the problem the Problem is, I don't know sign language. But there are deaf people who only communicate in sign language that I would love to communicate with. But I don't have a way to do that. And I was trying to come up with a way that I could solve this problem. And I had recently gotten this little kind of toy that sits on your desk, and it has a little set of infrared LEDs in it, and you can plug it into your computer, and then it'll give you It looks up at your hand with it has a little camera in it and a shiny IR at your hand, and it can figure out what positions Your hands are in. So this is an example of someone who's sort of holding their hands up like this kind of over the little sensor. And you can see that each of these little balls on the screen is a point in three dimensional space. And they're generally joints on your hand. If you look at where these lineup and then the end of your fingertips and there's one in the middle of the wrist gives you certain points on the hand. So we were kind of trying to figure out can we use this thing to like teach a computer sign language have a computer be translate sign language and somebody, that'd be really cool. And we were going to this hackathon that was happening at Texas a&m. And we thought, okay, sign language is actually really hard. I don't think we're gonna be able to do all of that in 24 hours. So we figured what if we just do American Sign Language? And then even further, what if we just do the alphabet? And we thought, okay, maybe we can get something that will be able to tell us what letter the alphabet we're holding over this sensor.

So now we move into what the data looks like. we plugged the thing in, and we held her hand over the sensor. And we said, Hey, this is an A, and we just clicked a on our keyboard a bunch of times and move her hand around and got some training data, never made a B and held it over the thing and hit be on our keyboard a bunch of times and move your hand around to get some training did. So what you see here is we have X, Y and Z points for each joint in the human hand, it gives you 20 points, and then we have as our output value of the sign, so that's an A, A, B, C, etc. So there's 26 of those in total.

Here we go. Are y'all ready? What kind of machine learning problem is this. supervised learning? Yes, classification, both correct classification is a certain kind of supervised machine learning. Great job. Yeah. And it's it's classification because there are only 26 values. Great work.

So when we started to solve this, the first thing we had to do was pick a model. And there are a lot of different algorithms out there. The one that we showed earlier was called a linear SBC. But there's a lot of machine learning algorithms, and we didn't know which one was going to work best. So we started off by splitting our data into training data and testing data. Then we just got a bunch of models together, and we trained them all on the training data. And then we evaluated them all in the testing data. And we picked the one that did best on the testing data. And we didn't really know what we were doing at this point. But as I discovered later in life, and this is not the worst way to do this, as long as you don't repeatedly do this, and you will end up having a pretty good model by this. Doing this. You can run into a problem where if you do this over and over and over again, what you end up selecting for is my models that are really good at your testing data. And that testing data might in some ways differ from real life. So as long as you're not doing this too much, you'll be okay.

So once we have this model, we figured, okay, it's cool, we have a model, but like, we can't just tell people, it's a model for sign language, we have to build some sort of application. And the first thing we tried to do was make a keyboard. And that did not work very well at all, we could not figure out a good way to figure out like when the hand was changing from like, between signs and the accuracy on the model was actually not as good as he wanted it to be. So sometimes we would make like a J. and j is actually assigned where that it moves. And the pressure we were doing everything static until like, there are certain signs that we just didn't have a good way to characterize and we weren't doing super well on. So we tried making keyboard it did not work. But the interesting thing was, it was good enough to make a little like Rosetta Stone for sign language kind of thing. And I think this was one of the things that we we learned it was really important was It's not just about the model.

So here's a little demo that I'll show you. Assuming the Wi Fi works. It does. Yeah. So this is what the game looked like, you can see my messy desk. But it would sort of show you a sign and say, Hey, make this letter, it's a B, or j or whatever, and give you some amount of time. And you would go through and make the letter. And once you've got it, right, it would, you know, give you points and reward you with some place in the leaderboard at the end. So I made a little Rosetta Stone kind of thing that we called sign language tutor.

The code for this is available. If you just want to see how this stuff kind of works. Feel free to go look at this. And I will be there's a link to the slides at the end of this. Also, if you want to just go there you can you can click on this link and it'll load leveraging a lot of open source tooling is it was a really helpful way get through this in 24 hours, psychic learn was obviously a big plus. And then we also use Redis and flask as ways to make this possible.

So let's talk about some lessons here, it's really important that you come up with a good way to define the problem that you're working on. We originally started with saying sign language, this is this is what we wanted to tackle. But what we realized was you have to scope it down much smaller than that. Oftentimes, the best way to solve a big problem is to break it into smaller problems, and then solve each of those individual smaller problems. This is something that I didn't realize how practical it was going to be in real life. This isn't that I run into all the time at work is we want to solve this really large problem. We don't quite know how to do that until we can find a smaller subset of it that we can. And that's what that's the important thing about limiting scope is figuring out what can we actually achieve in a reasonable amount of time to prove that this is a valuable thing to do. We also talked a little bit about how to select models and it's important to do that and This is a reasonable way of doing that. Critically, though, and this is another thing that I wasn't expecting would be a lifelong lesson out of the hackathon was that the model isn't the only thing that matters, you could have a model that's not good enough to be a keyboard, but is good enough to be a language learning game. And this is something that if you're working in a corporate setting, you'll want to work with your product people and you want to go talk to your customers and really understand what they need out of this and figure out well, maybe we can, even if it doesn't solve this use case entirely, maybe we can reduce their workload by 50% or something like that. And that can still be a really valuable way to apply machine learning.

Next use case let us talk about forecasting energy load, we're going to go through those same five questions. So the first thing, what is the problem? problem here is that we need to know when to schedule energy production, which by which I mean if we pretend for a second that we operate an energy grid and we're trying to deliver power to a lot of residential and commercial customers. We need to know when they're going to want Energy, then they're going to use energy because we don't have good ways of storing it for very long. I'm not a hardware person at all. I know nothing about engineering like real engineering. But I don't think batteries are very good right now. Like, that's the sense I have. And so we have to often schedule when energy, like when we spin up our power plants in order to get that to be close to the time when people need the energy so that it can get to them or something like that. This is not an entirely hypothetical problem. There's this agency called the energy reliability Council of Texas, and I live in Texas. So this is why this is a relevant example for me. But for those of you who are not familiar, because you would have no reason to be familiar with the energy system in Texas, we have this deregulated energy market where you can buy power from like whoever you want, and those people selling you power then are turning around and buying it from other people and and it's not this job to manage that grid and make sure that things are happening at the right times, they sort of divided into these zones that you see up here by weather. Because in Texas we, as I'm sure also here in North Carolina, we love our air conditioners in the summer, we like to not sweat. And that. So that's the weather is the driving factor for energy and most of Texas.

So let's talk about what the data looks like here. We have for each of these weather zones, some amount of power being used on an hourly basis. And what we kind of have is the input data is the day and the hour. So just like the time that energy is being used, and then the output data is we could pick any one of these weather zones and say, okay, we want to build a model that can predict, you know, overall usage, or we want to build a model that can predict usage just in the south region or whatever. I just think this graph is also kind of fun. So this is a graph of like energy usage over time. And you can see the seasonality in here you can see when it's summer because there's these big spikes where people are using their air conditioners a lot more. And then you can also kind of see where winters were colder because you'll see people suddenly using their heaters more, which is interesting. So even at this point, this is kind of cool.

But at this point, we are now to the question of what kind of machine learning problem this is we're trying to predict how much energy is going to be used at a certain hour of the day. Regression didn't I don't think it's any anything else. regression. Okay. Yes, it is regression. Yes, thank you all for your participation, I really appreciate that. So, it is a regression problem.

This is kind of an interesting different regression problem than the earlier example we had of trying to predict how much of a credit line you should extend somebody. And the reason that is, is because time series data exhibits seasonality. So this is looking at the overall system load by week and you can see that we have these ups and downs. And these correspond with the seasons of the year because human behavior often maps pretty well with the seasons. And the You see this a lot in time series data. And by time series data is simply mean data where you have some time component of it.

If you're using time series data, and you're doing what I already told you to do, which is randomly split training data from testing data, you're going to leak information and that will hurt you bad. So let's sort of look closely at this orange point here, you'll see that it's kind of surrounded on both sides, like both earlier and later, there are blue points. And if this orange point is a testing data point, and we try to predict what the energy value is going to be for that specific day, the blue points around it, give us a lot of information about what that orange point, hasn't it. And so, what you see happening is, is that we know the future effectively if we keep these if we change if we split the data randomly, our model ends up knowing the future with respect to some of our testing data points, which is not a good thing and it won't happen in real life. So you can trick yourself into thinking you have a really good model when you actually don't.

When you're using time series data, it's important that you split based on the time. So instead of doing that random thing to do kind of what I've drawn up here, where you have here, I've done six different splits of the data. And what you would do ideally is split it up this way where you have Okay, up to a certain day, this is training data. And after that is testing data, and that more closely mimics what will happen in real life. We're in real life, we have everything that we've seen in the past. And that's our training data. And what we're going to be testing on is everything that's happening in the future. Now, another critical thing to know about time series data is that some models don't do a good job of picking up on the different seasonal trends, they're not able to figure that out.

So our Savior here is this open source library called Prophet, which has integrated a lot of the learnings about time series data into a nice easy to use package. And we'll sort of walk you through on each of those training, testing splits it sort of figure out what the seasonality pattern so you can see you can see with only a year of data, it doesn't really figure out what exactly is happening, you can't figure out that this is sort of a sign, just a little wave. But then as you get more and more data, it starts to become more and more confident and know better and better, what the seasonal trends are.

So we have learned some things today. The first thing that you should take away is if you run into something that has a time component to it, you need to be extra careful, because there's a lot of things about it that are special and you can lead yourself astray. seasonality is the biggest one, that when you do if you are to do a random train test split, you will know the future when you're doing training, which will mess you up.

Alright, last one, we're going to talk about using machine learning to find your next job. The problem here was that a few years ago, I was not like actively job hunting, but I was just interested in seeing what's out there. You know, I was just sort of passively looking around. And when I signed up for like job newsletters, they were way too noisy. I got way more jobs than I ever wanted to look at and I was wasting reading through these emails I was like I don't want to do is I want to get an email that has like three jobs in it. That might be cool, right? So what I started to do was I would go look at job search listings, and I would copy and paste the, the title and the company and then a link to the job description. I did this Google Sheet. And then when I was bored, like as I was at the bus station, or waiting in line for something, I would go, and I'd read your job descriptions and come back to my spreadsheet and Mark whether or not it sounded cool to me. And I will have to admit to you here that I definitely spent way more time reading job descriptions for this than I would have if I just bit the bullet and dealt with the noisy emails. But if there's not, I mean, you know, we're at an open source software conference today. So if there's not like a safe place where I can be among over engineering nerds here, then there's no good place for me. So cut me a little slack on the overengineering here.

So the next question is what kind of machine learning problem This, we're trying to figure out whether or not a job sounds cool. Classification anyone think it's anything other than classification? Okay, shaking hands. Yes. Great job. Yeah. classification. Wonderful.

So at this point, we want to build a model. And we've seen in the past ways to build models. But the ways that we've seen in the past are all numerically based. Someone's age can be represented as a number, the network can be represented as a number, we can represent the day of the year as a number, we can represent the hour of the day as a number. But how do we represent a job title as a number? This is kind of confusing. And so when I first ran into this problem, I did what I highly recommend any of you do, when you don't know how to do something, Google it. And so I searched like text representations and machine learning. When I got back was this thing, which I'll explain to you this is what is often called a word count vector, or some people will call it like a vector space model of language is not perfect, but it's a good start. Let me explain how this works. You Put all of your job titles down the rows of the matrix and you have all of your all of the words that occur in any job title along the columns of the matrix. And then what you do is you have a zero or a one. Or you can have more than that if there are if the same word occurs multiple times in each little spot. So for the first example, we have seen your web applications, developer data analytics. And so we look through each of our columns and say, Does engineer occur in this job title, and it does not. So we say it's zero. Web does occur in the job title, so we'll put a one there, then applications career so we'll put a one there, etc, etc.

And this is really boring to do by hand. So we should use a scikit-learn tool that we'll talk about in just a second. But we're able to do with this is turn this data scientist job title into this set of numbers. Now we have numbers and we can take that sounds cool thing and turn it into a number just a one or zero. Now we have numbers and we can just use the models that we already know about to learn something from these numbers.

This is the way that we can kind of do that. It's just some example code of using scikit-learn. And so the first thing we do is gather together our data. So we take our titles and our whether or not it sounds cool and turn them into a matrix. And then we use this thing called account vector Iser, which is the thing that does this word count vector creation. And it's the name count vector Iser, and then we call this method that's fit underscore transform. What this does is it causes the CountVectorizer, to count up or to figure out what all the words are, and then make the word count vectors, we then turn it into an array just because it's more convenient to do that. Then we have another model that we haven't spoken to today. But there's a model called logistic regression that we can use and fit it on this data that has now been transformed into being vectors as well as the rate of the ratings. Then we can take some new jobs and predict whether or not they sound cool. And then we get this array at the bottom here that sort of commented out, which says the first job did not stop Cool. The second job did not sound cool. But then that fourth job there. Sounds cool.

So I did this, I ran this and I use that error metric that I told you earlier was a good error metric called the classification here, came out and it said, I had a point 197 classification error, which means I am right, about 80% of the time. And I'm thinking this is amazing. I am the greatest data scientist in the world. This is awesome. But, it turns out, I realized what was happening was, in all of the job titles that I'd read all the job listings that I'd read, only about 20% of them sounded cool. And it turned out that what my model and figured out was, if it just said nothing sounded cool, it would only get 20% air and it's like, that's pretty good. Let me just do that this is way easier.

So this is a self portrait I drew after I realized what had happened and was very disappointed and very, very sad about the fact that my model had realized that it could exploit this sort of imbalance in the data. One way that we can combat this is by using Another tool for evaluating error called a confusion matrix. And in this, we take and put the predicted labels on one axis and actual labels on another. And then we count up the number of examples. For instance, this top left has the number of points, the number of jobs, which I said they weren't cool, where they actually weren't cool. And then to the right of that we have the number of jobs that actually work well, that I said weren't cool. And we can see from this, we can see Oh, my model is just predicting zero for everything, and we can catch the air a lot more easily.

Another way we can do this is through metrics like precision and recall, precision gives us a number which quantifies for all of the jobs that I say are cool that my model says are cool. How many of them actually turned out being cool? And then recall tells me for all of the jobs that are actually cool, how many of them am I bringing back to the surface? How many of them are my recalling and saying are cool? As if I was using these error metrics? I would see that I had a recall of zero because I wasn't bringing back anything. jobs that work cool.

Some other techniques that are common when you're dealing with unbalanced data are over sampling. And under sampling. In under sampling, what you do is you just take that majority class, so we had 80% of our data sounded not cool. And we would just throw away a bunch of it until we got to an even split. And then at that point, we would train a model on that synthetic data set on that data set that had half cool and half not cool. And our model would do a little bit better than just always saying, not cool.

Another way we can approach this is through a technique called over sampling, where we just duplicate points in the data set that are that are cool, and that way we can get back up to an even split that way. The reason we have to do this is that some models kind of assume that you have an even split between points, and that's not always necessarily the case. And there's a lot of detail that goes on with this imbalance stuff and you can do a lot more research about this and I would be happy to talk with you more about this. But just so you have an idea of some ways to approach It over sampling and under sampling are both good ways of addressing this problem.

What I ended up doing here was using under sampling. And then what I did was I made myself a little email thing that will automatically go look at my spreadsheet and find the 10 jobs that sounded the very coolest and then I get this much shorter email. And my beautiful over engineer Grizzle would come back I would see this there are some things that we have learned. Let's talk about that. First thing is to understand the base rate like know if your model is actually doing good or not, you need to know what would happen if I just did the stupidest thing possible. I just predicted the most common taste, what would happen. Another thing is it simple doesn't mean ineffective. I started this trying like hoping that I would end up using deep learning or something. And then it was good enough with like the simplest model that you learn on like day two of class. And that kind of made me sad because I really wanted to try out something cooler but it ended up working and that was great.

The reason for this is something that's called the approximation just realization trade off. And I will explain what that is right now. What this means, as you might guess from the name is that there's a trade off between the level of approximation that we can do and the level of generalization we can do. approximation means for that training data that we have, how good is our model at representing that training data? So we can see here, our red model, which is a nearest neighbor model, which is capable of memorizing data sets is doing any incredible job at memorizing the data set, it approximates the training data perfectly, it has the exact right answer for everything. Well, on the other hand, our green model is not really approximating the input data set very well, it's kind of a little bit far away in certain places. And that could be problematic. Here's where the simple model can do really well though. When we start to wonder about the generalization, our civil model does a lot better, because it has given up on some approximation to get itself closer to the true function. You can see that this red model is really far away from a lot of the points in the The testing data set, whereas the green model is doing a lot better on those points. It's a lot closer to those points. So the question is, do we want to get really good at knowing the training data? Or do we want to generalize and learn some more broad pattern. And when we use simple models, we usually win in terms of the approximation, generalization a trade off.

The other thing that's nice about simple models is that they're just easier. And it's been a little while since I've tried to set up TensorFlow and pytorch on my laptop, but it was not very fun last time. And so I do learn is really easy to set up. So I would highly recommend just trying something simple and starting out with that.

Okay, y'all, take a deep breath. I'm going to take some water. There's a lot of stuff there. We're going to we're going to summarize real quick after this, but just everyone take a deep breath.

Alright. We talked about supervised learning, where we use past examples to predict a continuous value in the case of regression, or a discrete value in the case of classification. We also said that to measure the performance of our models, it's really smart. It's a really good idea to split the data into subsets of training and testing data. Another thing that we mentioned was that it's important to keep it simple, stupid, the simplest thing that could possibly work might actually work. And then you don't have to do anything more complicated. And you will probably learn something valuable along the way. The last takeaway I have for you is to test and iterate, build a model. Try it out, see if it's good. If it's not, you can always build another model. And if it is, you're done. That's great.

Thank you guys so much for coming. And I want to give a quick shout out to my employer I work for Indeed, and we like data and we like helping people get jobs. If you're interested in any of that or you just want to talk about machine learning, I would love to chat with you about anything data related. My Twitter handle is up there -- I talk about data online. Or if you have another session to get to and you do have a question but you don't want to come up after feel free to email me. I love getting email from people who care about data stuff, that's my personal email and it will get to me. But yeah, I hope that from this you're able to take and do some really cool machine learning stuff. Thank you. And if you do have questions, feel free to come up. I'd love to talk to you.

Don't try to sound smart when giving a presentation

Samuel Taylor — Thu, 29 Aug 2019 14:15:15 GMT

Think about the last time you gave a presentation. Were you nervous? Excited? Scared to look stupid in front of your boss? Confident in your ability to wow the others in the room? These emotions reveal that at our core, our biggest desire when presenting is to make ourselves look good. This desire is antithetical to a good presentation and harmful if not addressed directly.

Our first priority must be the audience. They're giving us their attention, and we have the responsibility to use that precious resource wisely. If we waste it, they'll be less likely to give us this attention in the future. While we may think we're being clever with our sales pitch, people are smart enough to see right through us (even if what we're selling is our own self image).

Prioritizing the audience requires us to adapt our message to them. This doesn't mean dumbing it down. As an example, consider presenting some interesting result to two audiences: a coworker and the CEO. The CEO of your company will require more context than your coworker. This does not mean that the CEO is somehow dumb; it means that she isn't as familiar with your work (for obvious reasons).

Instead of prioritizing the audience, we often prioritize ourselves (typically without even realizing it). We want to appear smart, competent, and confident in front of others. Subconsciously, this leads us to present in ways that obfuscate the truth in order to make ourselves appear intelligent. If our audience leaves a presentation thinking that we are a genius, we have probably failed to explain our idea in a way that they can understand.

We should go into presentations hoping our audience finds what we say to be "common sense". If they agree with what we've said without much questioning, we have likely guided them through our ideas in a way that is easy to understand.

If you've never given thought to the specific people that are in your room, try it out. Do research to understand what they know and how they best digest information. If you already do this, consider doubling the amount of time you spend here.

When we go into presentations seeking to look like a genius, we end up confusing rather than enlightening. Those who try to look impressive ultimately fail to do so, but (paradoxically) those who give up on impressing do.

A five minute meeting shouldn't ruin your productivity

Samuel Taylor — Thu, 22 Aug 2019 11:53:02 GMT

As developers, we deplore being interrupted. A stream of Twitter posts lament that a five minute meeting can harm productivity for hours. But it doesn't have to be like this: we can dramatically decrease the cost of being interrupted with a simple technique. What's more, implementing this technique makes us more effective and confident in our work.

Problems with this narrative

I can't talk about this without first critiquing it. The framing of these complaints is often adversarial.

That five minute meeting with a developer... pic.twitter.com/xPgYwDzCnY
— Richard Campbell (@richcampbell) June 21, 2019

When we create a "developers vs. others" mentality, we decrease the level of trust and psychological safety on our teams. Instead, we should seek to understand the "other" in our midst; why does our PM have to ask a dev a question so often? Is our documentation bad? Are we changing direction too quickly?

A good working relationship allows both sides to understand each other better. If we want to decrease the amount of interruptions we experience, posting divisive tweets is a strategy that may actually have the opposite effect in the long term.

Working alongside every member of our team is a much better strategy for creating change than spreading insular, us vs them ideas.

Taking matters into our hands

We aren't helpless here; we can change our own behavior to address this issue. Here's what's been most effective for me: taking notes. It's a simple but powerful technique. A solid set of notes enables us to change contexts with confidence, knowing that returning to a previous context will be easy.

Because of the time/effort it takes to gain, well, context on a certain task, context switching is hard. Human brains have limited capacity for remembering things like "What line was it that was causing that exception?" or "Which features were most important in the last training of this model?". Conveniently, computers are great at this kind of thing.

While I started taking notes to avoid keeping too much state in my brain, I have realized a couple other benefits:

Record of results: as a Data Scientist, I do a lot of experiments. For new problems, I'll often try several different things. Seldom do they all work! My notes here function as a sort of "lab notebook", enabling me to review the experiments I've run and reflect on their implications (either by myself or with colleagues).
Multiple projects: when I'm blocked on something, it's become much easier to switch to work on another project. Having a few projects at once is much easier to manage when the current state of each project is written down rather than in my head (or on JIRA).

Writing good notes

So how do we take good notes? Here's what works for me. Each project gets a document. While working on the project, I append things:

Surprising discoveries (for example textual descriptions, drawings, data visualizations)
Helpful bits of documentation (StackOverflow questions, company wiki)
Descriptions of what I'm trying to do and how I'm trying to do that. Even a few words or a file name and line number are enough to jog my memory.
Results of things I've tried. Did a certain test fail? What was the MSE when I include this feature?

This is a fractal process, so I find it useful to use bulleted lists with multiple levels.

The particular medium we use for this isn't important. I use OneNote (because its free tier offers everything I need), but there are a slew of other products you can use. Digital note apps are convenient because they sync across devices, but you could also use plaintext documents, paper notebooks, or even a whiteboard if you prefer.

You can do this!

You may think that creating this set of notes is more trouble than it's worth. What surprised me when I started doing this was that writing down my progress ended up being fairly easy. All in all, it beats the heck out of losing your place because someone interrupts you.

Of course, this isn't a panacea. Being interrupted still can cause us to lose track of things we haven't quite written down yet. And we do have to build the habit of making good notes. But the costs are worth it.

You should do this

A written log of work gives us confidence that we can execute on our projects. Because we track state for each project, we can stop worrying about forgetting something valuable. Every time we step up to a task, we have a helpful record of what our goal is and the ways we've tried to achieve it. We free up space in our brain for more useful tasks.

A five minute meeting doesn't have to ruin your productivity. By letting computers do what they're good at, we free our brains to do the difficult, valuable intellectual work that most fulfills us. Take notes as you work–it's your new superpower.

Recognize Class Imbalance with Baselines and Better Metrics

Samuel Taylor — Fri, 02 Aug 2019 05:00:00 GMT

Model-agnostic feature importance through ablation

Samuel Taylor — Tue, 23 Jul 2019 20:50:22 GMT

Feature importances are, well, important. We can use them to provide a rudimentary level of interpretability; if a feature has higher importance, it has greater impact on the target variable. Some machine learning models have an innate way of calculating feature importance (decision trees, for instance). Others don't have a way of doing this (for example, support vector machines using an RBF kernel). Further, some models result in a set of coefficients (like linear regression) that are easy to misinterpret (e.g. if you have two features with dramatically different scales).

Feature ablation is a technique for calculating feature importances that works for all machine learning models. Given a dataset of n rows and m features, the procedure goes like this:

Train the model on your train set and calculate a score on the test set. You can pick whatever scoring metric you like.
For each of the m features, remove it from the training data and train the model. Then, calculate the score on the test set.
Rank the features by the difference between the original score (from the model with all features) and the score for the model using all features but one.

Example code

Here's an example of how we could actually perform this procedure in Python using scikit-learn.

First, we import some things we'll need: load_digits will load in the digits dataset. SVC is the model we'll use. train_test_split is a utility method that splits the dataset into training and testing portions. sklearn.metrics has a lot of pre-defined metrics in it.

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import sklearn.metrics as mx

We load and split the data.

data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

Now we define a function which will train and score a model for us. Given the data, it creates and trains a support vector machine, then returns the accuracy. Finally, we store the score of our model with all features into base_score.

def score_model(X_train, X_test, y_train, y_test):
    clf = SVC(gamma='scale', kernel='rbf')
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    return mx.accuracy_score(y_test, y_pred)

base_score = score_model(X_train, X_test, y_train, y_test)

Then, we iterate through all features, creating an array use_column which we use to select all columns except for the one which we're currently scoring. We store the score of a given model in the list scores.

scores = []

for i in range(X_train.shape[1]):
    use_column = [ndx != i for ndx in range(X_train.shape[1])]
    scores.append(score_model(X_train[:, use_column],
                              X_test[:, use_column],
                              y_train,
                              y_test))

Finally, we get the top 10 features.

sorted(enumerate([base_score - s for s in scores]),
       key=lambda ndx_score: ndx_score[1],
       reverse=True)[:10]

"""
[(12, 0.005555555555555647),
 (21, 0.005555555555555647),
 (5, 0.002777777777777879),
 (10, 0.002777777777777879),
 (17, 0.002777777777777879),
 (18, 0.002777777777777879),
 (20, 0.002777777777777879),
 (34, 0.002777777777777879),
 (37, 0.002777777777777879),
 (46, 0.002777777777777879)]
"""

Relation to stepwise regression

You may recognize this idea as being similar to backward stepwise regression. Wasserman (2005) describes this technique for model selection as "we start with the biggest model and drop one variable at a time" (p. 221). We drop variables until the score has decreased beyond some acceptable level or until we have reached the desired number of features. He notes that this is a greedy search and is not "guaranteed to find the model with the best score." If we were to use scikit's recursive feature elimination in combination with this feature ablation technique, we would be using backward stepwise regression.

If you do decide to apply stepwise regression, be careful with the test set used to evaluate the features. If you choose features that optimize the score on the test set, you are overfitting to the test set (and any metrics calculated for the test set will be incorrect). If performing stepwise regression, I would recommend splitting the training set into 5 folds and performing cross validation to select features. After that process, metrics calculated on the test set remain valid (because it was not used during training).

Conclusion

This technique provides a general way to calculate feature importances for any classification or regression model (even those that don't natively support them). It's also an element of a feature selection technique called stepwise regression.

Comments? Questions? Concerns? Please tweet me @SamuelDataT or email me (sgt at this domain). Thanks!

References

Wasserman, L. (2005). All of statistics: A concise course in statistical inference. New York: Springer.
Grande, E. Browsing record store shelves. Unsplash.

Linear interpolation in Postgres using generate_series

Samuel Taylor — Fri, 07 Sep 2018 20:21:04 GMT

I like to keep track of how many miles I'm driving in my car. One conceivable way of doing this is to create a table in a Postgres database in which I can track this information.

CREATE TABLE mileage (
    observed_date DATE,
    observed_mileage INTEGER
);

Unfortunately, I'm not always the most regular data collector. I often collect this data with gaps of days or months between each reading.

INSERT INTO mileage(observed_date, observed_mileage) VALUES
    ('2018-05-21', 84088),
    ('2018-05-26', 84201),
    ('2018-06-13', 84910);

I want to get some sense for how much I'm driving each day, and one reasonable way I might do that is to linearly interpolate the mileage between readings. For instance, if I see a reading of 10,000 on August 1st and a reading of 11,000 on August 10th, I want to see that on average I drove 100 miles each day 1-10 August.

How can we do this in Postgres? First, we pair the data up:

SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
  , LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
  , observed_date AS obs_date
  , observed_mileage AS obs_mi
FROM mileage;

which yields result:

  lag_date  | lag_mi |  obs_date  | obs_mi
------------+--------+------------+--------
   |        | 2018-05-21 |  84088
 2018-05-21 |  84088 | 2018-05-26 |  84201
 2018-05-26 |  84201 | 2018-06-13 |  84910

Then, we generate a series between each pair of dates:

WITH paired_dates AS (
  SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
    , LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
    , observed_date AS obs_date
    , observed_mileage AS obs_mi
  FROM mileage
)
SELECT *
FROM paired_dates
  , generate_series(lag_date, obs_date, INTERVAL '1 day') days(driven_date)
LIMIT 10;

which yields result:

  lag_date  | lag_mi |  obs_date  | obs_mi |      driven_date
------------+--------+------------+--------+------------------------
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-21 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-22 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-23 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-24 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-25 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-26 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-26 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-27 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-28 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-29 00:00:00+00

Note that 2018-05-26 occurs twice in the driven_date column. We can fix that by stopping our series just before getting to the later date:

WITH paired_dates AS (
  SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
    , LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
    , observed_date AS obs_date
    , observed_mileage AS obs_mi
  FROM mileage
)
SELECT *
FROM paired_dates
  , generate_series(lag_date, obs_date - INTERVAL '1 minute', INTERVAL '1 day') days(driven_date)
LIMIT 10;

Anyway, now we need to calculate the actual number of miles driven on the driven_date.

WITH paired_dates AS (
  SELECT LAG(observed_date) OVER (ORDER BY observed_date) AS lag_date
    , LAG(observed_mileage) OVER (ORDER BY observed_mileage) AS lag_mi
    , observed_date AS obs_date
    , observed_mileage AS obs_mi
  FROM mileage
)
SELECT driven_date
  , (obs_mi - lag_mi)::NUMERIC / (obs_date - lag_date) AS miles_driven
FROM paired_dates
  , generate_series(lag_date, obs_date - INTERVAL '1 minute', INTERVAL '1 day') days(driven_date)
LIMIT 10;

which yields result:

     driven_date     |    miles_driven
---------------------+---------------------
 2018-05-21 00:00:00 | 22.6000000000000000
 2018-05-22 00:00:00 | 22.6000000000000000
 2018-05-23 00:00:00 | 22.6000000000000000
 2018-05-24 00:00:00 | 22.6000000000000000
 2018-05-25 00:00:00 | 22.6000000000000000
 2018-05-26 00:00:00 | 39.3888888888888889
 2018-05-27 00:00:00 | 39.3888888888888889
 2018-05-28 00:00:00 | 39.3888888888888889
 2018-05-29 00:00:00 | 39.3888888888888889
 2018-05-30 00:00:00 | 39.3888888888888889

And thus I have achieved the desired result.

Machine Learning Crash Course

Samuel Taylor — Tue, 10 Apr 2018 17:54:24 GMT

Delivered at:

AnacondaCON on 10 Apr 2018. Slides available here. Video embedded above.
Windy City DevFest on 1 Feb 2019. Slides available here. Video available here.

Transcript

Have you ever applied for a credit card? A few weeks ago I was up late at night and apparently my idea of a good time is to try to churn some credit card rewards. So I was on a bank's website entering information about myself, and I got to a point where I had to hit a submit button to submit my application and I clicked submit and within a split second the page loaded and they told me whether I got the credit card, and my mind was just blown by this because I thought, "Surely they don't have a squadron of people sitting around reviewing applications. It's 2:00 a.m.. And even if they did those people couldn't possibly make a decision that quickly. And I was just very confused by this. But the secret here to how they're able to do this lies in machine learning. In other words people can take past examples from history and use it to create a formula that allows them to predict future outcomes. I am Samuel Taylor as introduced and today we are going to be going through a crash course in machine learning. There are a lot of things that this talk is and aspires to be. And chief among them is to introduce y'all to machine learning concepts. If you don't know anything about any of this stuff I hope that you are able to come in here and feel like you took something away.

But if you do have a little bit more experience I hope that this is going to be helpful for you still because you'll be able to see some different approaches that you might not have considered before and hopefully that will be beneficial to you. I definitely want to orient this toward application because I find that sometimes when we talk about machine learning especially at an introductory level it becomes a really weird and hypothetical scenario where you're saying, "Oh well if you have this situation then you might want to do this" and it just gets weird to talk about and it's a lot more interesting to talk about an application level. That said it is still very respectful of the theory in this field because there's a lot of really talented people who've put a lot of really good effort into understanding the way that these models work. And if we are able to gain some of that understanding, we'll do better at the application. However, you're not going to be a data scientist after this. We have about 45 minutes. This is not going to be an entirely comprehensive event here. There is a lot of stuff that I won't be able to cover and a lot of things that we're going to have to gloss over because again there's only 45 minutes here. I think that tutorials and detailed code examples are really interesting and there's going to be a few code snippets but those are, I think, less applicable in a forum like this.

I'm happy to talk to any of you afterward about specifics and give you links to GitHub repos and stuff if you're interested. But for the purposes of this talk we will try to stay a little higher level and talk about what the models used were and some of the techniques and try to focus less on the specifics. This is how we're going to be doing all of this stuff. We'll start off with a few minutes just on what even is machine learning and then we have sort of a warm up use case that is this credit card application example I'm talking about. And then we'll dive into 3 use cases. Basically three problems that I ran into and the ways I ended up solving them including what I learned from those. Finally we'll just wrap up everything and hopefully tie a nice little intellectually tasty bow on this package. So for those of you here in this room if you've heard the phrase machine learning before raise your hand. OK, some of y'all liars because I have said it at least three times already, but now I know which ones of you are aren't telling the truth. If you feel like you've done something with machine learning that you found significant or just interesting, I'd love to see a hand. Awesome. OK. For all of you, I hope you get something out of this for the rest of you, we're so glad you're here, and I think that by the end of this you'll be able to try to do something really cool.

All right. There is a lot of stuff that is machine learning really and a lot of times people break it up into supervised and unsupervised problems. Again this isn't comprehensive. Supervised learning is what we're going to talk about for most of today and what we'll talk about first. In supervised learning we have a set of data and a set of output data and the input data maps to the output data. So we could have, for instance, credit card application data that maps to whether or not someone should get the credit card. So let's talk about classification first. In classification we have data that looks something like this. This is again for credit card applications, and this is entirely hypothetical data that was generated by me drawing some stuff on a graph and then reading the graph. So it's completely fake but it gives you an idea of what's happening here. In supervised machine learning we have our features and our output, and here the output is whether or not someone is given credit. That's the thing we're trying to predict, and the input (or the features) are the age and the net worth.

So we have some input and we're trying to produce some output. You could take all this data and put it on a graph and we say we have seen these six credit card applications in the past from people of various ages of various net worths and we want to know whether to give them a credit card, right? And for all of the hype and excitement around machine learning this is pretty much just drawing lines. That's all we're going to end up actually doing here. So we're going to do some really fancy line drawing today. And we can draw a line through here that separates these data points and then when a new data point comes in, we can ask the model, "Hey, should we approve this credit card?" and it can look and say, "Well, it's on the 'approve' side of the line so we'll approve this person."

Regression is very similar to classification in that it has still input data and output data. But instead of predicting what kind of thing something is (trying to predict a discrete value with a certain number of outputs), it's trying to predict a continuous value. So we're trying to answer the question "how much" of something. To continue in the bank, let's say we are trying to give people loans, and someone comes in and they say "My net worth is this, I want a loan. How big of a loan are you willing to give me?" And you could have a bunch of data that looks something like this. And we could see well we gave this person with a million dollars a very large loan and we gave a person with less money less money. And then again all this is is just fancy line drawing. So we draw this line and that's what we call "fitting" the model and then we'll want to predict a new customer that walks into our bank and they'll say, "I have $500,000, how much of a loan are you willing to give me.?" And you just jog up to the line and then jog over to the Y-axis and say we're going to give them sixty thousand dollars. So pretty simple stuff with supervised learning.

Unsupervised learning: also interesting. One of the biggest areas in it is called clustering. It's probably one of the key algorithms to know in this space. With clustering, we don't have any any outputs. We just have inputs pretty much. We see we have nine data points and here each data point is a different shape and we might have a table that has the number of sides and the color that this shape is but we don't really know what what we're looking for here. With clustering we're just trying to uncover some underlying structure in the data. So we might hand this to a computer and say hey split this into three groups for me. And it seems like you can split this into three groups a number of ways but the computer might just say here's three groups of these things. And it will provide you with those 3 clusters.

There's a lot of other stuff and this goes back to the non comprehensive aspect of this talk. This looks like a slide. It's actually secretly an insurance policy against angry people on Twitter saying that I didn't cover their favorite subject.

So all in all the way that I sort of think about machine learning is that we want to find some function in the world. There is a mathematical function that describes whether someone should be approved for a loan whether they are going to default on a loan that exists out in the world. The problem is that we can't possibly know that because there are so many factors going into it. I mean, if someone comes into an unexpected medical expense they might default on a loan and there's all sorts of different factors that affect this. And so it's impossible for us to know all of these things. The interesting thing and the thing that gives us hope is that we're able to measure some points from this. We can look at the past and see: well, we know that these points aren't a perfect representation of the actual function but we can see kind of some area around it and then build something from there.

So the goal in machine learning especially supervised learning is that we are trying to find some algorithm which will give us a G of X to approximate F of X. So we're not going to be able to find the true function, but we want to find something close to it. So all that said we're going to be walking through three use cases today and I wanted to just have a warm up one so that we can all get used to this. We're going to be walking through five questions for each of these cases: What's the problem? What does the data look like? What kind of machine learning problem is this? And then we'll dive into some details on the solution, and we'll talk about lessons learned. So in this case for our credit card application data the problem is we're trying to decide as a bank whether we should give a consumer a credit card and the data we've already seen today. It kind of looks like this. We have some input features we have output. So here is the time for all of your beautiful faces to shout at me if you're angry. Get all of your aggression out and tell me, "What kind of machine learning problem is this?" Supervised, yes. Yes. Any more specific than supervised? You're going to be like a little. Yeah. Classification yes. Y'all are brilliant, this is amazing. Yes, so this is a classification problem, and we could solve this by drawing a line because that is what we're doing today.

There's a library in Python called scikit-learn that implements a lot of these algorithms for you so you don't have to waste the time doing that yourself. And basically what you can do is import a model and say, "fit this model to the data that I have," and that will do the step of drawing a line and then you say, "predict" on a new data point, and you give it the input data, and it will give you the output that it thinks is true. That's sort of what you do. But then we kind of ask ourselves what what did we do here? Did we actually accomplish anything? Is this any good? So we want to know how accurate this thing is. In this case, obviously this is very contrived data and it's deliberately drawn to be easy. And obviously it's going to get 100 percent accuracy on this. But what does that mean, and how can we think about this more generally?

To give you an example I have taken a true function that I made up, and I put it on a graph. You can't see it yet. But then I drew some observations from that function to sort of model what happens in the real world and that's all these little blue points that you see. And then I fit two different models to them. You can see they do different things. Basically one of these is better than the other. I think you might be able to tell. Does anyone think Model B is better than model a approximating the true function underlying this? OK. OK is anyone going to be brave and say that model A is doing a better job here?

OK. OK. Some brave souls out there. So in this case the true function is actually definitely modeled better by Model B. In usual situations we don't actually know what this green line that I've drawn here which represents a true function is we just have observations. And so for a human being we can look at this and say, "OK yeah clearly Model B is better" but there's caveats of being able to tell a computer how to do this. So what we end up doing is holding out some data for testing and it looks like this is a little bit small on the screen up there. So I will describe that about 20 percent of these points are red and the rest of them are blue. And what we've done here is we had some data that came in to us and we said, "Some of this we'll call training data and some of this we'll call testing data." And what we can do is basically train our model on the training data and then predict the values for the testing data and compare the two. Then after that, ideally what we could do is calculate the actual cost of making an error. So if I was designing a system that could tell from radar data whether a nuclear warhead was headed toward the United States and I said there was one and there actually wasn't one that would be a huge mistake. Probably a lot of people would die. Maybe everybody. However if someone is able to fingerprint into my phone that is a much lower cost than literally the destruction of the human race.

It's basically going to be some dank memes are going to get out there and that's not really going to change the world that much. So ideally you'd be able to determine what kind of error you're making and how expensive it is and optimize your model for that. In the real world it's not always as clear. And so there are some error metrics that you can use to try to help you understand how your model is performing. One common metric for regression problems is mean squared error. I've drawn a little example up here with some true data and then our predictions. And basically what you do here is you subtract the true value and the predicted value and then you square that number and you sum all of that up and then divide it by the number of points are predicting for and we can say, "We were off by 80 you know the metric here is eighteen point three five" and you try to minimize that error when you're comparing models to each other.

In other cases like classification you can't use that because it doesn't make as much sense but one common metric that makes a lot of intuitive sense is classification error and basically you're just trying to see, "Did I do the right thing.?" So if on four of these points I correctly classified that they were a certain class then I would say my error was twenty five percent on those points. So in this use case we've learned a few things. The first thing I would say is that this stuff is just pretty neat. There's cool things we can do. It's less intimidating than we thought it was; it's just drawing lines. And then also that it's important to withhold testing data so that way we can evaluate how our models are doing. So in our next use case we are going to talk about teaching a computer a sign language.

So the problem here is that I don't know sign language, and I want to communicate with deaf people because they have valuable things to communicate. And I was trying to think about this problem and a friend and I were going to a hackathon when we were in university and we were thinking of what we could do with this little toy that we had. This is a thing called the Leap Motion. It's a little sensor that has an IR LED in it, and you can see that in that picture here because cameras are weird. I guess I don't know all the the weird camera stuff that makes infrared visible to this camera but not to the human eye. When you're actually looking at it in person, you can't see that. But basically what it does is it flashes infrared light up on the human hand and can tell basically where hands are in space. So for this example someone has a little sensor sitting on their desk or potentially mounted to like a VR headset and they're holding their hands kind of like this up to it and you can see what the computer is seeing here. Each of these little dots that is connected by the little pipes I guess is a place on the human hand. By and large just joints in the hand and then also fingertips. And we were looking at this and thinking OK what if we could do something with this data to try to help with this problem with sign language.

And we only had 24 hours, so we thought, "What if we limit the scope here, and we just try to do something for American sign language?" (because there are a lot of dialects of sign language). And also we just tried to do the alphabet. Let's start small and see if we can get anything working. So we gathered some data. We took our little device and plugged it into the computer and then we made an "A" above it and we said "This is an a, this is an a, this is an a", a bunch of times we gave it a bunch of examples of what an A looks like. And then we made a B above it and we said, "This is a B, this is a B, This is a B" and we taught it. We gathered data about what these things looked like. You can see here that we have x, y, and z coordinates for each of 20 positions in the human hand. And so you end up getting 60 features for one output and the outputs here are the letters of the English Alphabet A through Z. So I think y'all know what time it is; it's time to shout again. What kind of machine learning problem is this?

Yes, definitely supervised. I also heard someone say classification over there I forget who it was. But you did a great job. So yes this is a classification problem because there are a discrete number of things we're trying to predict. There's just 26 things and we know it's one of those 26 things.

So let's talk about how we went about solving this problem. The first thing we needed to do was choose a model and that's something that you'll often need to do when you're doing machine learning stuff. And we were somewhat early in our time of doing this kind of stuff so we weren't really sure what we were doing. We just took a bunch of models from scikit learn and just said try all of them and then we will evaluate all of them on the test data and pick the one that did the best. And that is actually, as I learned a little bit more, that's actually not a bad thing to try. Just try a bunch of different things and see what works best. Then once we did that it's not enough to just have a model; we had to build some sort of interesting application around it because it's not very cool to walk up to someone to say, "Hey, I made this thing that can tell you from handwriting data if you're making a sign for a certain English letter." That doesn't mean anything to anybody. And so we thought we would try to make a keyboard, and we were working on it and it would read out whether we were making an a, a b or a c or whatever and put those on the screen and we could see oh cool we're doing this.

It turned out that it wasn't quite accurate enough. And we even tried to do some stuff to do some Markov Chain stuff which basically takes into account the fact that certain letters are more common after other letters. So for instance "U" is way more common after "Q" than after a lot of other letters. Anyway we weren't able to get that to work very well and so we decided what if the answer here is to just try and try to make a different application around the same model. And we found that we could make a little educational game around this and we basically tried to market it as Rosetta Stone but for learning sign language. And this is a little demo that I will play. So it's a little game and it'll tell you to make a certain letter with your hand and then you try to make your hand look like that letter and then once you get it it gives you some points. At the end you can put your name in on a scoreboard because everybody likes to compete. Anyway, that's sort of what we built. And there were a lot of things we learned in this process.

The first thing that I didn't expect to be useful broadly because I thought oh we're just doing this for a hackathon. It's not a big deal. Limiting scope is huge. There are a lot of really huge problems in the world and if you try to tackle one you're going to just get lost. So it's really important to find some chunk of a problem that you can actually solve. Selecting a model is something that you're going to probably have to do if you're doing something like this. And this approach isn't the worst one. There's a lot worse you could do than to just try out a bunch of things and see what works best. The final thing I learned is that it's more than just the model. You can have a really interesting model that's good enough for a language learning game. And if you were trying to make a keyboard with it it wouldn't work very well. This is something that if you're working in a company you'll probably want to work with the product people in your company to decide what your users actually need, what they want, and what would be helpful to them. Try to gear your model toward that. Because at the end of the day we're all trying to make software for human beings.

Alright, let us move on to our second use case of the day. This is about forecasting energy load in the state of Texas. So the problem here is if we pretend that I am operating a power grid I have to know the demand at various places in order to be able to schedule the production of energy. We don't have excellent ways of storing energy for long periods of time so you kind of have to get things scheduled to where they'll be used shortly after they're created.

This isn't an entirely hypothetical problem. There is an organization that is known as the Energy Reliability Council of Texas and it is their job to manage the energy market in this state. For those of you who are here from out of town, Texas has (in most places) a deregulated energy market where power is generated and then sold on a market. And power companies buy it up and then they resell it to consumers. I'm not here to debate the advantages or disadvantages of regulation in the energy market. I think that would be a much longer talk if we were here for that. But the gist is that in a lot of places (Austin being a notable exception) the energy market is deregulated and you have to know when the when the demand is going to happen.

So ERCOT publishes a lot of data. They published the last at least 10 years of data I think it's 14 on energy load on an hourly basis in the various weather zones. So if we look at this, these colored regions are different weather zones because it turns out that weather is the biggest factor affecting energy usage because air conditioners are a wonderful blessing in my life but are also very expensive to operate. So they break this down on a weather zone by weather zone basis and on an hourly basis and then they also provide a sum (but I didn't care about that as much). And you could plot this on a graph and see over the last 14 years how much energy has been used on a daily basis in each of these different weather zones. It's kind of interesting honestly even at this point just looking at this graph I mean like, oh people are using more energy that's kind of an interesting thing to see and you can definitely see the seasonality of when summer rolls around in Texas. People use a lot more energy that's kinda interesting.

I think you are all prepared for what is next. What kind of machine learning problem is this? Regression! Yes, wonderful! This side of the room is killing it--y'all need to work on your game.

So a simple approach here to solve this regression problem is to just find the five nearest days to you and say that we'll take the average of those. And that's basically a k-nearest neighbors model where you find the five data points closest to a certain data point and average them together and then that is the output. And it turns out that because this is a time series we can set the data point value (basically the input) to be the number of day in the year it is. So for instance January 1st would be the first day of the year. Today is the hundredth day of the Year. Happy 100 days of 2018 everybody. And you can set that to be the input value and then the output value being whatever the energy load should be. This is sort of a time series data problem and being able to turn that into a regression problem is interesting. There's actually been a lot of study around that and this is a simple approach that is reasonable to go with (I think). When you're evaluating time series data there's a lot of things to consider. We obviously can still look at the error rates like a simple error rate just take our predictions versus what actually happened and see the absolute value of that and divide it by some reasonable-- like divided it by the actual number and then we can see we're off by 3 percent on average which is fine. I mean it depends on how accurate you're trying to get.

But another specific thing you'll want to do when you're working with time series data is look at what's called the residuals which is each of those individual data points where you take the predicted and the actual and you subtract them. And the goal is that there isn't a pattern in there if there's any sort of seasonality and that as you look at it you haven't quite fit the data as well as you could. The other thing you want to see is if your residuals resemble a normal distribution. If they're skewed one way or the other then you may have made a mistake somewhere. There's a lot of things I learned about this. The first thing I learned is that I did this in a very wrong way. You should really do a lot of research about this stuff beforehand. There's actually a lot of research on how to do stuff with time series data and the approach that I chose is actually not the most unreasonable way to do it but it isn't the best. There's a lot of really good tools out there like Facebook has a tool called Profit that is built specifically for predicting time series data and is used at Facebook we use it at my company and there's a lot of places that it's used and works really well. These libraries do a really good job of taking into account common things that happen in time series data. For instance holidays happen and energy usage is going to be way different on a holiday. So that's something to keep in mind.

The other thing I'll say is scaling the features is important. So for this problem my features were the number of day and the year that it was in were there were two input features and then I had my output. But the number of day it was is a different scale than the year. The year runs from 2014 to 2018 or whatever. And that's a different range of values than 1 to 366 potentially. That can cause problems when you're doing k-nearest neighbors stuff. Specifically, this is a similar example. This is actually the credit card application data we were looking at earlier. But just to demonstrate the problem when you don't scale your features. If we were trying to predict this yellow point on the left here you can see it far to the left and then a little bit above the red X. We're going to be looking at the features around us to try to find what point is closest to me that I can say that I'm going to be like that point. When we look at this with our human eyes we say obviously the red X underneath it is the closest point. But it turns out that the way that the features are scaled the net worth is such a larger number like just a bigger number than the age.

Age only runs up to 1 to 100 ish. Net worth can be a much larger range and so that has a huge impact on the distance. These numbers plotted next to each point on the left is the distance from the yellow point to the point associated there. So you see that actually the closest point is this one that's 50000 units away. And so it gets classified-- you can see on the right it gets classified as an "approve this application" even though it doesn't quite look like we should. What you can do though, there's a thing in scikit-learn called a standard scaler and it will take these things and scale them to to where the mean is zero and the standard deviation is one (which is really helpful in a lot of circumstances). So when you look at this visually they look the same because they kind of are; it's just the actual values have been scaled to a different range. And then when you look at the difference in how that ends up classifying, it looks more like what we would expect to happen. So scaling your features is an important thing to do (especially when you're doing something like k-nearest neighbors) but is also helpful when you're using other models.

All right. This is a recent project that I've been working on to use machine learning to find your next job. The problem that I ran into about a year ago was that I was passively job hunting.

Basically, I wasn't out there actively knocking on doors and handing out resumes or anything. And I was reasonably satisfied with where I was. But I was interested in hearing if there was a particularly excellent job out there that I might want better. I couldn't find something that did exactly what I wanted, and it seemed like I was getting a lot of noise coming through from just reading job listings. There were so many things that obviously I didn't want to look into. So I was wondering if I could make this a machine learning problem. I ended up doing was scraping a bunch of job listings. I would get the title and the company and then a link and then for a long time whenever I was bored I would just go to this spreadsheet on my phone, click on the link, read the job description and then come back and say whether or not I thought it sounded cool. I gathered a bunch of data like this and if I'm being honest with you I probably spent more time reading job descriptions this way than I would have if I didn't build this. But because all of you are here I don't think any of you have room to talk about over engineering something. So if I can't talk about my love for this here I don't think there's any safe place for me.

Anyway, you should be familiar with this question by now, what kind of machine learning problem is this?

Classification! Yes, wonderful. Thank y'all. I heard someone say clustering. It's not quite clustering because we do have a specific output variable that we're trying to predict. We want to know whether a given job sounds cool or does not.

The way that I ended up solving this is kind of tricky. So if we look at the other problems we've seen today they're all numerical data, right? We have a day number and a year. Those are obviously numbers. We have a net worth and a loan size; those are numbers. Age is a number; net worth is a number. How do we turn a job title into a number? Computers can't deal with text. You can't just throw text at a computer and have it know what to do. You have to find some way to turn it into a number. And when I ran into this I was thinking what am I going to do here. I have all this data about text but I don't know how to fix that. So I turned to our trusty friend Google and I searched "text representations for machine learning" which pretty much is the way that you should learn a lot of stuff is just search search it up. And I found this. This is an idea that is a good thing to try first. It's not state of the art by any means, but it's a good first pass. It's called a word count vector, or people will call it a bag of words.

And basically what you do here is you take all of your job applications, you find every word that occurs in any of the job applications, and place those along the columns of a matrix. And then you place each job title along the rows of the matrix, and then to fill in each slot in the matrix you look at the job title and the column and say, "Does the word in the column appear in the job title on the row?" So for instance this first one: engineer does not occur in that title but web does and applications does and senior does. And I'm not going to walk you through filling out this matrix because even after 4 I'm a little bit bored of it. And one thing to note though is that while in job titles usually they don't repeat words you theoretically could. So I put that last one I don't know who's going to post a job titled "Data Data Data Data," but I'd be interested to hear about it. And so you can see where there' multiple occurrences of a word it isn't just a one it can be you know four or whatever. So. That's basically how we can turn the text and the boolean value into numbers. So that's this highlighted green part it becomes this series of numbers here and the highlighted blue part becomes a number there. And then it's really surprisingly simple to do this stuff because a lot of this functionality has already been built for us because there are giants upon whose shoulders we can stand and see much further.

So this is a really simple example where we take our rated jobs, pull out the titles, and then pull out whether or not it sounded cool, and then scikit-learn has this tool called a count vector which will take text data and turn it into those word count vectors I was talking about. And then we can take that data and put it into a model and fit it with the preexisting "sounds cool" or "not" data. All we have to do then is just predict on the data, and we get out this array that I've highlighted at the bottom that says, "OK the first job in the list you gave me doesn't sound interesting but the fourth one does." So that's the model I ended up building. And originally I was just doing this in a Jupyter notebook, and I was just running through it and I got a classification error of 19 you know around 20 percent was like heck yes, I'm God's gift to data science. This is going to be amazing. And then what I realized was it was just saying that everything didn't sound cool. And what I realized is most of the job listings I was reading didn't actually sound that interesting. And so I would just rate them as they didn't sound cool and the model picked up on that and it's like, "I can do super well if I just say nothing sounds cool."

So I committed what's called the base rate fallacy. And this is something that's important to understand when you're approaching a problem like this is to understand what would happen-- what are the underlying rates in these problems. Because I wasn't actually improving on anything. I was just doing as well as literally just guessing zero every time. So this is a self-portrait I drew after I discovered that I made this problem. Dealing with imbalanced classes like this is fairly common. And so I wanted to provide a little bit of insight into good ways that people do this and the way that I ended up doing this. The first thing that you can do is use better error metrics. The only way I realized that I was having this problem is because I knew to look for the base rate (and now all of you know to do that). But these are metrics that will help you understand your data in a little different way. There's no one metric that's going to be perfect for every situation, but having a family of them can help you understand what's going on much better.

Precision and recall are related concepts. In our case precision means how many of the job titles that I said sounded cool actually are cool and recall means of all the job titles that do sound cool. How many am I saying sound cool. These give you a better understanding of how you're doing in terms of like false negatives and false positives and stuff.

The other thing that is useful is to use what's called the confusion matrix which you can see at the bottom here and what you do there is you put the predicted values on one axis and the actual values on the other axis. And if I were to do something like this I might see well I'm predicting zero-- and, actually, I filled this out wrong-- but I'm predicting zero for everything, and I would notice that error much more quickly.

Other than using better error metrics, one thing you can do is called "under sampling" and this is what I actually ended up doing. I had 500 job titles (let's say) and only 100 of them sounded cool. I took all of those and then I took 100 randomly selected not cool job postings and I made a new dataset out of just those 200 and I trained a model on that and that got me a much better accuracy rate for the jobs that did sound cool. Another technique that people do use is called oversampling which is kind of the opposite. So if I have those 100 cool job postings I would take those and have four copies of each of them, so that would give me 400 cool postings and then 400 not cool postings and I could just train my model on all of that. I've never actually used that because I feel weird about doing that but it's something you can do if you want to.

So in the end what I ended up doing was getting this all running in essentially a cron job on a remote computer and it will every week email me just a list of the top 10 jobs that sound the most interesting. So this is another thing where we talk about how do we want to use this model. If I were to just have it spam me all the jobs it sounded cool that would be more than I want to look at. But, because I chose a model that can predict the probability of something. Logistic regression is able to tell you how probable it is that a certain job sounds cool, and I could just pick the 10 that sounded that had the highest probability of sounding cool. That gives me a much shorter list to look over each week.

So some lessons that I learned from this. Obviously the first one is understand the base rate because that can really make you sad. The second thing is that doing something simple doesn't mean that it's going to be ineffective. Do any of you watch The Office, or have any of you watched the Office? OK, so there's a scene in there where Dwight is talking about Michael and Michael is his boss and his boss comes to him and he says, "K I S S keep it simple stupid. It's great advice and it hurts my feelings every time."

And that's kind of how I felt about it. I'm like I want to do something cool, I want to do deep learning, man! But it ended up just being good enough, and using a very simple model worked.

So the approximation generalization tradeoff is a theoretical concept from machine learning that can help us understand why this works. And as you might guess from the name it means that if you have more approximation you're going to have less generalization; if you have more generalization you're going to have less approximation. Those words don't really mean anything so I drew a graph that will help us understand. Again, here I've made up some data. The blue line is the truth. And then I sampled some points from it with a little bit of random noise in there to again model the real world. What I did was I fit two different models to it. One of them is a simple model, linear regression, which you probably learned in a high school algebra or precalc class. And then one of them being a more complicated model which can effectively memorize any data set that it wants to. What you see here is that for the points that I'm showing right now, the red model is killin' it. It knows every single spot; it has zero error on those. It is approximating the data set extremely well. It knows the training data by heart.

However, what you may notice is that when we add more data to this it does not do as well on those points. What you see is the green model doesn't do as well on the input data as the red model does, but it does much better on the out of sample data (on the testing data). So we have this tradeoff where more simple models are generally better at generalizing even though they're worse at approximating. So that's sort of why it's a good idea to start out with something really simple and basic and work up from there. The other good reason to do this is that it's easier. scikit-learn is a conda install away versus, I mean in my experience, setting up TensorFlow is hard. And even once you get it set up training stuff can be sad and hard and long, and it's just a lot easier to start out with something simple that you can iterate on quickly and learn and learn a lot about your problem space before you go into something more complicated.

So we've just talked about a good amount of things. This is sort of a summary that we can view some key concepts from each of these use cases we've talked about. For the teaching a computer sign language, what we ended up doing was support vector machines (which is a model that is useful). It's built into scikit-learn. In the forecasting energy load in Texas data, it was time series data and what we found was using k-nearest neighbors worked really well. code:2:4

However if you're doing time series data you should probably do some more research and probably use something like Profit that's specifically built for time series data. Then the last use case we just talked about. If you run into text data, it's at least worth trying Bag of Words. It has its caveats; it has its downsides, but it's a good first step. And I ended up using logistic regression and that works really well and I get the email every week and I'm happy with it. So it works pretty well.

So basically there are some takeaways I have here. And then some recommended tools that we'll talk through. The big takeaways being (from the very beginning) in supervised learning, we want to use past examples to predict a continuous value in the case of regression or a discrete value in the case of classification. And those two correspond with questions like "how much of this thing?" or "what kind is this thing?". And then another huge takeaway is to try the simplest thing that could possibly work. This is something that my machine learning professor tried to beat into our heads and has proven to be very effective in my experience. Once you have that simple thing that is kind of working you can always test it out and iterate and maybe try a different model maybe try a different set of features and work from there.

We've been kind of light on recommendations about specific tooling but just if you want a jumping off point Jupyter notebook is a great tool that lets you interactively run models and train them on various datasets and see how they look kind of. There are some some plotting tools like matplotlib and Bokeh that will let you see into what the data sort of looks like and can really help you get a better intuitive understanding for what's happening under the hood. Pandas is a great library for manipulating tabular data which, actually, all the data we saw today was all tabular data in that it had a set of rows and a set of columns. Pandas does a really good job of handling that kind of data. It can do things like read from Excel spreadsheets and read from HTML tables and read from CSVs and whatnot. Obviously I recommend scikit-learn. I used it for all of this stuff, and it's nice to not have to reimplement this stuff yourself.

There's a lot more resources available if you're interested in this stuff. If you're interested in more of the theoretical side I highly recommend a book called Learning from Data. It does a really good job of treating machine learning theory with respect. A lot of times when we talk about machine learning it feels like we're just pulling stuff out of a bag or pulling out a bag of tricks, and it's not really fair to think about it that way and there's a lot more to it than that. This does a good job of helping you understand how that works.

On the opposite side there is a blog called Practical Business Python that talks a lot about how to use these specific tools and if you're hungering for more after this talk about how to specifically do stuff. He has a lot of great resources about "how do I graph something? How do I read an Excel file?". It's really interesting, really good, solid, extremely practical, detailed read there. Then the biggest thing I would say as far as gaining extra experience from others is reading the Kaggle blog. They call it no free hunch (which is an adorable name) and they have a specific section on it for winners interviews which is where-- there's all these people who compete in machine learning competitions and then whoever wins they'll do an interview of them and say what did you do. Reading through those is a huge amazing resource that I don't think is being taken advantage of enough because you can learn from some of the best data scientists in the world about how they do their job and then apply those in your own work. If you're interested to hear a little bit more detail on the sign language or machine learning to find your next job part, these are links, and I'll tweet out the slides in just a little bit.

I have more information on my website, samueltaylor.org, if you're curious about those things. I'm also happy to talk to you afterward if you have any things you want to talk about. I do work for Indeed, and I would be remiss not to thank them for their support of me doing this kind of work and talking about this stuff in front of people. If you are looking for a job please come talk to me. We like data stuff. Beyond that, again I'm Samuel Taylor. I prefer communicating over email over pretty much anything else so if you have a question you're obviously welcome to come talk to me right now but if maybe you're a little shy or just don't want to talk feel free to email me. I love reading email. I might be the only person who loves email like that. And then also I am happy to read people's tweets-- if you have questions I'm happy to take those via Twitter as well. I'm @SamuelDataT. Would love to hear from you.

Thank you so much for letting me talk to you and take this time out of your day. I appreciate it so much. I really hope you're able to get something out of this if you have any other questions. We have about 5 minutes that I can take questions if you have them or I'm also happy to talk about it after this, but thank you.

Work Queues in Software and Productivity

Samuel Taylor — Wed, 28 Feb 2018 00:37:51 GMT

My first job at a real software company was after my sophomore year in college. As you might imagine, I learned a ton that summer. I learned about how companies organize themselves. I learned how software teams organize around different pieces of the product. I learned specific technical things. I learned how to read code. I learned about navigating large codebases. But the thing that that stuck with me most was a specific architectural decision that they had made: to use what is called a work queue.

In this article, we take a look at work queues in software and productivity. First, we'll examine what they are and look into a look into a real world use case. After that, we'll talk about how to know when to use them. Finally, we'll think beyond computers and apply them to improve our own personal productivity.

Explanation

To explain what a work queue is, let me give you a nontechnical example. Suppose that you are a high powered executive named Alice, and your company has decided that you need an administrative assistant to handle writing emails, scheduling calendar events, and other administrative tasks. They assign you an an executive assistant named Bob. You and Bob set up a system for communicating what Bob needs to work on. As you're going about your day, when you run into something that Bob can work on, you go to a large whiteboard. At the top of it you write, "Schedule meeting with Carol." And then you go back to your desk and continue with your work. When something else comes up, you go back to the whiteboard and write, "Email Dan to follow up on his project".

As you're working at your desk, all Bob has to do in order to figure out what he should do next is look at the top of the white board and find the next thing he hasn't completed. While you're writing an important document, Bob will go ahead and schedule that meeting with Carol. Once he's done with that, he'll send that email to Dan. If at any point the white board is empty when he tries to find a new task, he'll sit around twiddling his thumbs, not doing anything.

What we've described here is the core of what a work queue implementation could look like. There would be a few differences, obviously. Instead of an executive named Alice, we have a process A. This could be an Android application, a Ruby on Rails web app, or what have you. This application hums along serving content to the user, and then occasionally it sends out some work to another process. This is just like the way you would write on the white board for Bob, but instead of a human person named Bob, we have a "worker process" that we'll call Process B. Process B can handle stuff in the background as process A is interacting with the user.

The whiteboard in our example essentially is the work queue itself. The idea is that process A can offload tasks to process B.

Use case

Here's a real world example. A lot of times when you sign up for an account on a website, you're interacting with some web application that they have running. When you register for an account, you enter a username, a password, and your email address and hit a button that says "create account". One of the things that has to happen when you sign up for an account is they have to confirm that that's actually your real email address by sending you a confirmation email. One way they could do this is to have the application that is handling that sign up form just go ahead and send the email. The problem with doing that is that sending email can take a while. If the wep application itself sends the email, it's not going to be able to send content to your screen while it's doing that. So you'll be sitting there wondering, "Why is this page taking so long to load?". This is a terrible user experience! If we take a step back and think, there's nothing essential about sending that email right then as the page loaded. We can really offload that to some other process.

To improve the user experience, we can use a work queue. Rather than sending that confirmation email during the page rendering process, we could put a piece of work in the queue to send that email. Of course, we'll keep the essential application logic of creating a user in the database, but that happens pretty quickly, especially compared with the time it takes to send an email.

How would we implement such a system? Well, a common way is to have our main process produce messages to some sort of message queue. There's lots of options here, but two common choices I've personally used are RabbitMQ and Apache Kafka. Then, in our main process, we'll produce messages to that system. The only thing we're missing now is what we've called "Process B", or our "worker node". For this, we create a process that reads messages off the queue and does work based on those messages. As an aside, if you're using Python, definitely checkout Celery.

Benefits

So why do I love work queues so much? The first reason is that they give us a chance to decouple things. Because we're creating an interface wherein Process A just has to say, "Hey, I would like this thing to happen", that same process doesn't have to know anything about the other process. Let's say Process A is a Python application. We probably start off with workers written in Python, too. If we're smart about using language-agnostic serialization, we give ourselves more flexibility in the future. If we need some library that's only available in Java or have some task that's really best suited to Haskell, we can create worker processes in those languages. This gives us a lot of flexibility to choose the right tool for the job.

The other nice thing about using a work queue is that it can make scaling easy. While we've been calling it process B, it doesn't have to be a single process, or even a single server. Say one day a ton of people start signing up for our website. Our work queue starts getting longer and longer as Process B isn't able to keep up with all the work it needs to do. An easy way to handle this issue is to start up 10 more instances of Process B. One nifty thing we can do is dynamically scale the number of workers we have based on how many things are still in the queue. If our workers start to fall behind, spin up a few more instances. If the queue is frequently empty, spin a few down.

When to use

Let's talk about when to implement a work queue. The key insight to note is that when we find a piece of work that is easily parallelizable, that's a good candidate for this kind of system. In other words, if we encounter a problem where we can break apart a large task into a number of similar subtasks, we could likely put those tasks into a queue. For example, we might want to scrape a bunch of webpages. To do this, we could create a message that includes the URL of the page we want to scrape and says, "Hey, scrape this thing". Then, we have one process spit URL's into the queue, and a number of processes reading from that queue, scraping pages, and storing results in the database.

Productivity technique

Beyond being a nifty technical tool, I've been able to find applications for this in my working life. That example at the beginning (where Alice was farming out work to Bob) is actually pretty similar to how I operate day-in and day-out. Except instead of farming out work to an administrative assistant, I'm farming out work to future me. Basically, when I encounter something that is a little chunk of work that I know I can do later and that is going to knock me off task right now, I write it down in a list. I set a specific time each day to go look at the list and knock out all the things I need to do. This technique has helped me be more productive, because batching little tasks like this all together means that during the course of my day, I make fewer costly context switches between deep, analytical tasks and more administrative tasks.

Further, I have found that the amount of context in those analytical tasks is usually much greater than the context for an administrative task. That means that if we group the administrative tasks together, switching between them may still result in the same number of context switches, but they are each less costly.

I will forever be grateful to the people I worked with during that summer four years ago. Getting exposure to common patterns and concepts has been immensely helpful in my work as a software engineer, and I hope that hearing about this idea will help you solve a problem some day.

Using DISTINCT ON in Postgres

Samuel Taylor — Mon, 12 Feb 2018 00:39:08 GMT

Every once in a while, I'll have a need to do a one-to-many join, but keep only a certain row in the "many" table. For instance, say we have a system to track inventory in retail stores:

CREATE TABLE store (
    id SERIAL PRIMARY KEY,
    name TEXT
);

CREATE TABLE inventory (
    store_id INTEGER REFERENCES store,
    quantity INTEGER,
    item_name TEXT,
    PRIMARY KEY (store_id, item_name)
);

INSERT INTO store(name) VALUES
    ('School Supplies R Us'),
    ('Grocery Mart');

INSERT INTO inventory VALUES
    (1, 1, 'Backpack'),
    (1, 12, 'Pencil'),
    (1, 4, 'Pen'),
    (2, 12, 'Egg'),
    (2, 1, 'Flour (lb.)');

We can get the inventory for all stores easily enough.

SELECT name, quantity, item_name
FROM inventory
JOIN store ON inventory.store_id = store.id;

         name         | quantity |  item_name
----------------------+----------+-------------
 School Supplies R Us |        1 | Backpack
 School Supplies R Us |       12 | Pencil
 School Supplies R Us |        4 | Pen
 Grocery Mart         |       12 | Egg
 Grocery Mart         |        1 | Flour (lb.)

But what if we only want to get the item with highest quantity from each store? Fortunately, Postgres has a syntax that makes this easy.

SELECT DISTINCT ON(store_id) name, quantity, item_name
FROM inventory
JOIN store ON inventory.store_id = store.id
ORDER BY store_id, quantity DESC;

         name         | quantity | item_name
----------------------+----------+-----------
 School Supplies R Us |       12 | Pencil
 Grocery Mart         |       12 | Egg

What does DISTINCT ON do? Well, it selects the first row out of the set of rows whose values match for the given columns. The first row is arbitrary unless we pass along an ORDER BY statement. Note that we have to include the columns from the ON() clause in our ORDER BY. If we don't, we get a helpful error message:

SELECT DISTINCT ON (store_id) name, quantity, item_name
FROM inventory
JOIN store ON inventory.store_id = store.id
ORDER BY quantity DESC;

ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: SELECT DISTINCT ON (store_id) name, quantity, item_name
                            ^

If you run into a situation wherein you need to choose a specific row in a group based on some rules, try using DISTINCT ON. For more detail, check out the Postgres documentation.

Work-Self Balance

Samuel Taylor — Wed, 03 Jan 2018 12:58:11 GMT

It's a few months ago. I'm enjoying my job and trying to bring value to my team. A high-visibility project comes up. Despite being unfamiliar with the framework we plan to use, I'm excited to work on this project. I set out to "learn by doing", implementing a chunk of the project with this framework. After reading some code, writing some new code by trial and error, and asking a lot of questions, I'm able to get this chunk finished. More exciting than that, my implementation serves as a model for much of the rest of the project. Being that I know the framework well at this point, I'm teaching the rest of the team how to use it. I love knowing that I'm bringing so much value to the team!

Then, I submit a piece of code for review. One of my team members notices a flaw in it. Immediately, my thoughts rush to how I can justify the flaw. Underlying this reaction is the belief that this flaw reveals my own incompetence. I start to type a response. But as I write, I realize that my coworker is right. I thought I knew everything there was to know about this framework, but clearly I don't. I delete my response and fix the problem instead.

Since then, I've found a healthier way to think of my work. Let's examine the origins of and problems with my initial belief so that we can find a healthier alternative.

The core belief I identify as an issue in the above story is a lack of separation between my work and my self. Creating software can be a deeply personal enterprise. When I'm in the zone, it can feel like the code I'm writing somehow emanates from my being rather than that I am actively writing it. Given this understanding, criticism of my work is also criticism of myself, my character, and my abilities.

This belief (though not one I consciously chose) is harmful. If criticism is painful, human nature says to avoid it. Unfortunately, avoiding criticism means avoiding learning because the best learning can come from making mistakes and fixing them.

We can choose a healthier relationship to our work. Specifically, I find it helpful to mentally separate my code from my sense of self. In other words, I avoid tying my ego up in my work outputs.

This mental model is more true to the realities of software development. On a daily basis, I am faced with countless constraints. Perhaps I must complete a task within a given timeframe. Maybe I have to use a specific tool. These constraints mean that the work I produce cannot be considered to be solely a reflection of my character or abilities. In some way, the work also embodies the constraints I was under while creating it. Given an infinite amount of time and resources, I'm sure all of us would create impeccable and beautiful software. However, in a world of constraints, our work is less likely to be perfect.

How can we rein in our ego? I've noticed that as soon as I am aware that my ego is flaring up, it's easy to see how silly I'm being. To this end, I've found two activities helpful: journaling and meditation. Journaling consists of sitting down in the evening a few times per week and writing about what has happened in the preceding few days¹. By intentionally recalling and reviewing my actions, I'm able to be more objective in my view of myself.

Meditation also helps foster the lense of an unbiased observer. When I meditate, I get deliberate, intense practice at noticing my feelings. Bringing this awareness to the rest of my day becomes easier as I continue this practice. And this awareness allows me to notice when my ego acts up so that I can respond accordingly².

When I'm able to separate my work from my self, I can respond more productively to criticism of my work. I can see that the criticism isn't aimed at me. Instead, something I produced was found to be lacking in some way. Previously, I might waste energy either beating myself up or trying to justify a mistake. But now I can focus on making the necessary improvement.

I still love bringing value to my team. But I've realized how crucial it is to stay humble and how valuable it is to understand that my work is separate from my self. Making a mistake reveals not that I am incompetent, but that I'm human. And really, aren't we all?

1: If you're interested in journaling, I highly recommend Timothy Wilson's Redirect, which explores a bunch of interesting research in self narrative.

2: I'm sure there are other ways of increasing this kind of self-awareness. Let me know if you have any tips of what has worked for you in the past.

Poetry for Software Engineers

Samuel Taylor — Sun, 10 Dec 2017 22:33:25 GMT

Photo by Álvaro Serrano

I am not a poet. But when I was barely three years old, I was sitting at the kitchen table with my mother and older sister. They were working on rhyming words, because that is apparently the stage of childhood development my sister was in. My sister was struggling a little. My mom would ask her, "What's a word that rhymes with ball?" My sister replies, "bend." Mom says, "No, we're trying to make the end of the word sound the same. So a word that rhymes with ball would be fall. What are some words that rhyme with sky?" and my sister, with a confused look on her face, replies, "Soon? Sleep?". Mom says, "No, remember, we want the ends of the words to sound the same, not the beginning. So 'fly' or 'pie'. What are some words that rhyme with moon?" she asks. My sister says, "My?" and to hear my mom tell the story, my little voice pipes up, "Spoon! Loon! June!" which of course prompts my sister to give me an angry look and shout, "Shut up, Sam!"

I learned two things that day. First, it is immensely enjoyable to pick on one's older siblings. Second, I learned that words are interesting. They're fun. I like them. I like the way they sound when you say them and the power they give to express ideas.

In the time between that story and now, I've grown to enjoy poetry. This comes as a surprise to some who know me as a very analytically-minded person. And I get it–I mean, I'm a software engineer. Still, I've found a lot of joy in poetry. Beyond that, I think it's made me a better engineer.

Today we're going to talk about why I think you should read poetry. To do this, we're going to talk about how to read a poem, the ways that beauty is related to software, and a few things we can learn from poets.

So how do we read a poem? First, we need to be sure we're in the right mindset. In our culture, it's common for us to decide whether we like something immediately. For example, a friend of mine went to see the movie Wonder Woman, and I asked him whether it was a good movie. He replied, "Yeah, I liked it." But that's not exactly what I was asking; the quality of a piece of art or media is different from an individual's opinion about that piece. When we read poetry, it's important to understand what it's doing before we decide whether we like it or not. If you decide to read some poetry after this, I encourage you to read charitably. Try to understand what the author is saying and how she or he is saying it before you do anything else.

Once we're in the right mindset, we can dive into reading. For me, it's helpful to have a bit of a process to go through, and I imagine it can make this a little less weird if you are a so-called "left brain" person. I typically read through the poem three times:

On the first reading, I'll pick up a pencil and read through the poem in my head. When I get to any words that I don't know the exact dictionary definition of, I circle them. Then I look up these words in a dictionary. Poets spend a lot of time choosing the words they use, so it's important to understand what those very specific word choices mean.
Step two is to read the poem out loud. As I'm reading through, when I notice any words that stick out to me, I draw a little dot next to them. At this point, I'm not worrying about why they stick out, just keeping note that they do.
On my third reading, I look for allusion to other work, metaphor, imagery, and other literary devices.

Once I've done these three readings, I'm in a better spot to understand what the poet is saying. Let's do a practice run. This is a poem by Langston Hughes called "The Negro Speaks of Rivers".

I’ve known rivers:
I’ve known rivers ancient as the world and older than the
flow of human blood in human veins.

My soul has grown deep like the rivers.

I bathed in the Euphrates when dawns were young.
I built my hut near the Congo and it lulled me to sleep.
I looked upon the Nile and raised the pyramids above it.
I heard the singing of the Mississippi when Abe Lincoln
went down to New Orleans, and I’ve seen its muddy
bosom turn all golden in the sunset.

I’ve known rivers:
Ancient, dusky rivers.

My soul has grown deep like the rivers.

As I read through this, I circled the words "Euphrates" and "Congo"; I know they're rivers, but I'm not 100% on where they actually are or what they mean. I also circled "dusky", because I don't know the exact definition. Let me look these up.

OK, "dusky" means "somewhat dark in color; specifically : having dark skin". The Euphrates is a river in modern-day Iraq, and the Congo is in Africa.

Second reading. In the line "older than the flow of human blood in human veins", I put dots next to both occurrences of the word human. The word "bathed" in "bathed in the Euphrates" also sounded interesting, so I dotted that. I also dotted the word "golden" in "I've seen its muddy bosom turn all golden in the sunset".

Time for the third reading. On this pass, I underline "the Euphrates" as an allusion to the Biblical creation narrative. "The singing of the Mississippi" is an interesting image, as is "muddy bosom turn all golden in the sunset."

Now we're in a better place to understand what this all means. Langston Hughes lived in a deeply segregated America, and I find him making a powerful argument for a sense of common heritage and equality among all people. When he uses the word "I", he's referring to himself–a black man living in the Jim Crow society of the time. To say "I bathed in the Euphrates" is to paint a picture of the first man in the Bible as black. I imagine this was a shocking image to the society of the time, which had a "One Drop Rule" that meant if a person had even one drop of black blood in them, they were subject to the brutal segregation of Jim Crow. In this poem, Hughes shows that we are all human, thereby undermining the subjugating logic of Jim Crow.

I find this poem to be beautiful. Its construction is clearly well thought out, and the story it tells is a persuasive argument against segregation. But it may seem like we're on a bit of a rabbit trail here–how does any of this relate to software? Allow me to answer your question with a story.

Have you ever heard of "fizz buzz"? It's a well-known interview question in which you're supposed to print out the numbers 1 through 100, except when a number is divisible by 3 print "fizz", when it's divisible by 5 print "buzz", and if it's divisible by both print "fizz buzz". So the sequence goes 1, 2, fizz, 4, buzz, fizz, 7, 8, fizz, buzz, 11, fizz, 13, 14, fizz buzz, and so on.

The specific code to solve this problem could take a variety of shapes, but a straightforward Python implementation took around 9 lines. By contrast, there's a satirical GitHub repository called Fizz Buzz Enterprise Edition that consists of 1,387 lines of Java code spread across 89 files. Fizz Buzz Enterprise Edition is an exercise in using every design pattern you possibly can regardless of whether it actually improves the code.

I gave a presentation on the subject of poetry to some coworkers at one point, and I showed them the 9-line Python script followed by a single one of those 89 files in the Enterprise Edition. I asked them, "Which of these codebases is better?". The response was unanimous, of course: the Python script was better. When I asked them why they thought so, it took only a few seconds for someone to say the word "elegant."

And therein lies the answer to your question. When we start talking about what makes good code, the discussion quickly gets to the idea of "elegance", which I would say is a specific way of talking about beauty. In the software industry, we like to think of ourselves as innovators creating novel inventions, but when we start talking about beauty, we're incredibly late to the party. While some in this industry lambaste the humanities as useless, human beings have been trying to understand beauty for thousands of years. We would be foolish to throw away all we've learned.

Being a great engineer involves writing great software. And ideally, our software is elegant and beautiful. To get a sense for what those terms mean, we can turn to poetry. I've found that as I read more poetry and understand how it's constructed, I'm able to apply that knowledge to structuring my team's software. Understanding the way that a poet specifically chooses her or his words to fit the intention is fascinating and informs the way I choose to name functions, variables, and classes.

This takes us into our third point: we can learn a lot from poets. One day, the poet Ezra Pound was sitting in a subway station in Paris. As he looked around at the people walking through the station, he was overcome by a unique feeling. Like any good poet, he strived to capture that emotion in a poem. His first version was 30 lines. After six months, he'd carefully crafted and whittled it down to 15 lines. A year later, he published the poem In a Station of the Metro, which reads:

The apparition of these faces in the crowd;
Petals on a wet, black bough

Whether you like this poem or not, it's impressive in its ability to convey an image and an emotion by connecting two seemingly unrelated images. In the process of slowly and carefully carving away the excess and cruft from the poem, Pound is able to compress a lot of information into fourteen words.

We should strive to be more like poets. Good poets are able to express their ideas in precise language that communicates clearly. A huge part of the way they do this is by choosing their words carefully. This process is directly applicable to our work creating software. We're trying to encode some real-world process or use case into code that communicates our intentions to future maintainers. If we're sloppy with the way that we name things or structure our programs into functions, classes, or packages, the things we create will do a much worse job communicating the idea we have in our head to future maintainers.

Poetry is a useful hobby for software engineers. By helping us to better understand beauty and communication, poetry helps us develop skills that are directly applicable to creating good software.

I want to leave you with one last poem. Before I read this, I'm going to ask you to do something. I'm not trying to be manipulative, but I'm hoping to help you understand this author better. Take a few moments and think of someone who means a lot to you; it could be a family member or a close friend, but think about all the things that they mean to you and ways they improve your life.

The Gate, by Marie Howe

I had no idea that the gate I would step through
to finally enter this world

would be the space my brother's body made. He was
a little taller than me: a young man

but grown, himself by then,
done at twenty-eight, having folded every sheet,

rinsed every glass he would ever rinse under the cold
and running water.

This is what you have been waiting for, he used to say to me.
And I'd say, What?

And he'd say, This—holding up my cheese and mustard sandwich.
And I'd say, What?

And he'd say, This, sort of looking around.

Thanks for reading. I would love for you to tweet your favorite poem to me; I'm @SamuelDataT. Now get out there and read some poetry!

Monte Carlo Simulation with Categorical Values

Samuel Taylor — Sun, 03 Dec 2017 15:55:29 GMT

We live in a world of imperfect information. When faced with a lack of data, we can make a guess. This guess could be far from the truth; it could be spot on. In Monte Carlo simulation, we repeatedly make guesses of some unknown value according to some distribution and are able to report on the results of that simulation to understand a little bit more about the unknown. While any one guess may be far from the truth, in aggregate those outliers don't have as much of an effect.

I ran into a situation where I was gathering some data with some level of imperfection. My stakeholder wanted to know what the impact of that imperfection on the important metrics would be. I could have made a guess, but instead I turned to the data. Initially, I thought to calculate the best case and worst case scenarios. This idea is useful in that it gives you a range on what you don't know, but it's also beneficial to know how likely each of those scenarios (and things in between) are. That's where Monte Carlo simulation comes in handy.

For the purposes of context, let's use a contrived example. Suppose I run a car dealership, and a major hail storm rolled through last weekend. Some of my cars suffer major damage and will incur a 10% loss in their value, some suffered minor damage incurring a 5% loss, and some suffered no damage at all. I don't have enough labor to survey every single one of my cars, but I do want to know how much money I can expect to lose. I randomly select 500 of my 1000 cars and see how bad the damage was.

Hypothetically, it's possible that the 500 cars I inspected were the only ones that happened to have been damaged by the hail (maybe the rest were safe in my 500-car warehouse). That would be the best case scenario, in which case I suffer only the loss on the cars I inspected. In the worst case scenario, every car I didn't inspect suffered major damage. For some random data I generate below, we know that the amount of damage done to the uninspected cars is somewhere between 0 and around $1.75 million dollars. This is a huge range of possibilities!

We see that looking at the best case and worst case gets us some bounds on how bad the damage could be, but we have no idea how probable each of these options are (let alone the probability of the values in the middle). To find out how bad the damage is likely to be, we can turn to Monte Carlo simulation.

First, let's randomly generate our inventory, then inspect half of our cars. While we know the value of each car, we think of the damage_pct as an unknown value for the cars we do not inspect.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

N_CARS = 1000
cars = pd.DataFrame({
    'value': np.random.normal(35000, 5000, size=N_CARS),
    'damage_pct': np.random.choice([-0.1, -0.05, 0], size=N_CARS, p=[0.1, 0.5, 0.4])
})

inspected_cars = cars.sample(cars.shape[0] // 2)
uninspected_cars = cars[~cars.index.isin(inspected_cars.index)]

damage_dist = (inspected_cars.groupby('damage_pct').count()
               / inspected_cars.shape[0]).rename(columns={'value': 'prob'})

We can see the distribution of damage_pct among the sampled cars is¹:

damage_pct	prob
-0.1	0.096
-0.05	0.490
0.00	0.414

Because we inspected a random subset of the cars, a reasonable simplifying assumption is that the damage to the uninspected cars has the same distribution as that of the inspected cars. With that assumption, we can simulate the damage done to the uninspected cars like so:

def simulate_damage(damage_dist, uninspected_cars):
    damage_pct_guess = np.random.choice(damage_dist.index,
                                        size=uninspected_cars.shape[0],
                                        p=damage_dist.prob)
    return (damage_pct_guess * uninspected_cars.value).sum()

N_SIMULATIONS = 1000
simulated_damages = pd.Series([simulate_damage(damage_dist, uninspected_cars)
                               for _ in range(N_SIMULATIONS)])

To make sense of the output of these 1000 simulations, we can calculate some descriptive statistics. It's also helpful to look at the CDF of the simulated damage.

print(simulated_damages.describe())
simulated_damages.plot(kind='hist', cumulative=True, bins=100,
                       title='CDF of estimated damage')

count      1000.000000
mean    -595662.074433
std       25097.417355
min     -671499.963571
25%     -613690.059583
50%     -595382.298010
75%     -576089.086686
max     -524043.037134

We can say that in half of our simulations, the damage was somewhere between $671,499.96 and $595,382.30. This range is about 4.3% the size of the range between the best and worst case scenarios.

How'd we do? Because we made up the dataset, we can calculate the true value and put that number on the graph above:

true_damage = (uninspected_cars.value * uninspected_cars.damage_pct).sum()
bottom, top = plt.ylim()
plt.annotate('', xy=(true_damage, top), xycoords='data',
             xytext=(true_damage, bottom), textcoords='data'
             arrowprops=dict(facecolor='red', width=3, headlength=1))

In "reality", the true amount of damage done was $607,830.94, which just so happens to be in that window of 50% of our simulations.

We can run this experiment a few more times to see how this method fares:

The next time you're trying to reason about some unknown value, consider using Monte Carlo simulation to inform your decision-making process.

1: While the values for damage_pct look like numerical data, remember that they are representative of the three categories of damage sustained (none, minor, or major).

Use Machine Learning to Find your Next Job

Samuel Taylor — Sun, 12 Nov 2017 01:42:33 GMT

Delivered at:

DevSpace on 14 Oct 2017. Slides available here.
DataSciCon on 30 Nov 2017. Slides available here. Video available here.

Transcript

A wise man once said, "I got ninety-nine problems," and I can relate to that in some sense. Because on a day to day basis I run into problems; I run into things that aren't as easy as they should be or things that I want to be better. And I suspect that because you are all in this room on a Saturday you also have problems, you also run into things and want to use software to solve them. Today I want to talk about ways that we can use software to solve our problems and specifically to give those software solutions some intelligence using data.

Now the motivating example and the one for which this talk is titled is a job search helper thing that I made. Basically what happened was a few months ago I was passively job searching, which is to say that I wasn't actively out there knocking on people's doors and handing out résumés, but I was curious to see if there were any particularly excellent jobs in my area. I went out and tried to sign up for different job alert things that would give me the coolest jobs and I couldn't find anything that did exactly what I wanted it to do. Like any good Engineer I decided to build it myself. I built this little email newsletter that I would send to myself every week that essentially had the coolest sounding jobs in my area. I could go through and just review those jobs. It was basically a way to filter out a lot of the noise. And so we're going to use this as sort of a case study today, to talk about a process that I've gotten to use a few times that has worked for me. I wanted to share it with you all to hopefully provide some value in your own lives.

So this is how we're going to be doing that, the astute among you have probably noticed we are currently in the introduction. After that we're going to talk about asking the right question; basically phrasing questions in ways computers can help us answer them. Once we do that we'll talk about ways to gather the data, and then we'll analyze the data. Finally, we'll deploy the insights that we gather, and here I don't mean deploy in the sense of we're going to put the code on a server somewhere. That part is interesting, but more relevant to this discussion is how do we get a number to be something interesting to a person and express in a way that people can understand it.

So as for me I'm originally from this part of California (Bakersfield, CA). It's the really boring part, but it was a great place to grow up, and then I left and went to Baylor University where I studied Computer Science. I really enjoyed my time there, and while I was there I sort of got bit by the data bug and got to do some research in an autonomous drone lab that was really exciting, do some research with collaborative filtering (which is a recommender systems kind of thing) that got me started down the path of the thing that I ended up building for this. Another relevant thing I got to do while I was there was I taught a computer sign language which was really fun. And then over the summers I would go do internships in various places and learn a lot about software and good engineering practices. I tried to unify this all together, and then after I graduated went and started doing some data engineering work, and now I actually work at Indeed (which is the world's largest job site). Interestingly enough literally everything that I'm going to talk about today is completely unrelated to my job there (other than the fact that I do data stuff there too) but all the code that I wrote for this I actually wrote while I was at my prior company. But the most important thing on this map is the fact that we're all here in this room today to talk about this stuff, and I'm really glad that you are all able to come and honored by your presence here today.

So that's all the boring stuff–let's get in to the cool part! The first step in this process is to have a problem. This is the easiest step because it's the one that just comes naturally. It's the one that you bump into on a day by day basis. For me, I bumped into the problem that job alerts are too noisy. There's too many jobs for me to reasonably look over in a short amount of time. I've also run into problems where I was trying to figure out how to get my energy bill lower or trying to figure out how to get home from work faster. Once you have a problem the next thing you're going to want to do is start to think about solutions. In order to do that, you need to understand the ways we can ask computers questions and get useful answers back from them, which leads us to the fun buzzword of the day: Machine Learning.

If you have heard the phrase "Machine Learning" please raise your hand. Alright, yeah it's a buzzword we've all heard. If you feel like you have used machine learning in a substantial or interesting way, raise your hand. Awesome, some more great hands. Alright, one last question: if the phrase approximation-generalization tradeoff means anything to you, please raise your hand. That's fine we'll talk about it later, I just wanted to know what to go into. Thanks for your participation there.

So what is machine learning? There are a few different things that comprise it, and I'm going to talk about a subset of it today. There are a few kinds of algorithms broken into a lot of different categories, but this is good enough for today. There's a type of algorithm called a supervised algorithm in which you are basically feeding training data into a computer. That training data has a number of features that are like input values and then a number of output variables. Here on this graph what you see is that the X axis is age and the Y axis is net worth. The example problem here is basically: I'm a bank and people are coming to me asking for a line of credit. I'm trying to decide whether to extend them a line of credit or not. One way you could theoretically do this is to just look at your past history and say, "Okay when people have come to us in the past and asked for credit, how old were they? What was their net worth? And did we extend credit?" That's what this graph is displaying: the age and then the net worth. The plus or minus you can think of as a one or zero of whether or not we extended them a loan. And then the machine learning part of this is basically draw a line through this data. It doesn't have to be a line but a line works for this so we're just drawing a line. And you'll see here that because in the past we had someone who is ninety and did not have very much money who came to us and asked for a loan and we rejected them, if someone who is similarly aged and similarly wealthy came to us, they will be below this line, so we would not extend them a line of credit. That is an example of a classification problem, because there are classes involved. There is a "positive" (we did extend them a loan) or a "negative" (we didn't extend them a loan) kind of question. Classification is great for when you're trying to find out what kind of thing something is.

Now, if I were a bank trying to decide how much credit I should extend to a person, I would have to use a regression algorithm. Regression is very similar to classification in that you still have a number of input features and an output of some sort. Here it's a little bit confusing because here only the X axis is an input. On this last slide both X and Y were inputs and then the plus and minus was the output, but here we're just saying that the X axis is our input of net worth. Someone comes up to us and they say, "I have five hundred thousand dollars, how much of a loan can I get?" The "X" symbols aren't significant (they're just to mark position), but you can see (for instance) someone who had very high of a net worth down near that bottom left hand corner only got a loan of a thousand dollars because that was a more risky person (for instance). As a bank, I have all this historical data, and I can train some sort of algorithm that would again draw a line, and we could then say if someone comes up to us and has a seven hundred fifty million dollars net worth we can look at where they would land on the X position of the line. Then the Y value then would be the size of loan we extend them. Here it looks like seventy five thousand dollars or something. So that is another kind of supervised machine learning algorithm.

There are also unsupervised algorithms. One such algorithm is called clustering and in clustering you have a bunch of data points, and I apologize that this is not the same example, but you can imagine that each of these dots is a customer. The X axis could again be their age and Y axis could again be the net worth. Maybe it's computationally prohibitive to do the calculation on the full data set, but you could theoretically cluster people and say there are (for instance) eleven kinds of people. Then, depending on the kind of person you are, we could make a decision based off of that. In clustering, you're not trying to get a specific output, you're just trying understand the data better. It's often useful as a preprocessing step. You might again have someone come in and they have a certain net worth and certain age. You could say this person is really similar to this other kind of person that we usually extend a credit to, so let's extend credit this person.

There's a lot of other stuff in the field of machine learning that is time prohibitive to talk about today. So this third of this slide is here to prevent angry tweets because there's a lot of stuff that is really interesting that just doesn't quite fit in today.

So once we know the kinds of questions we can ask a computer, we can figure out a way to phrase our question. In my example, I'm thinking, "OK, job alerts are too noisy for me. What do I want? I want to know what are the coolest jobs. OK, well maybe I can ask a computer. Given my input of a job title, give me the output: does it sound cool or not (just as a one or a zero)." And that's a way we can phrase our question in a way a computer can actually help us with. So now that we have this formulation of our problem, we can jump into gathering our data.

There's a lot of data out there, and the best thing to do is just search for it. Go out and Google it. For instance, one time I was trying to determine my energy usage, and I thought it was probably going to be correlated with weather. I was looking for weather data, and there is this government agency called the NOAA that has a big weather data set that you can just download and use. And so it's very likely that you'll get out there and search for something and there's already a government agency whose job it is to collect this data, which is really exciting because it means you then have to do less work. In the case where you don't find something that already exists through searching, you can also try various websites. data.world is one, Kaggle datasets also has a similar feel where they have a bunch of existing datasets about usually more broad things. They don't tend to be a specifically relevant if that makes sense, although they'll have things like crime data on their website. That may or may not be useful to you, but if if you're trying to figure where to buy an apartment and you want to look at crime statistics, that dataset might already exist.

So you may or you may not find the data you need, and if you don't you're going to have to create it at some point. I like using spreadsheets for this because I can have them on my computer and I can have them on my phone, and anywhere I am I can collect more data. Other than that there's a tool called If This Then That that can be useful, especially when you're collecting data on your own personal habits. For instance, when I was trying to find out when the best time to leave my office was to minimize my commute time, you can get a little button that IFTTT will make for you where when you click it, it'll log your location and the current time to a Google Sheet. So what I would do was when I left the office, I would press the button then when I got home I would press the button again. In that way I could calculate how long it took me to get from my office to my home and at what time I left. And then I could have all this data about, okay you can leave at this time (that's your input value) and it took you this long (that's the output value). Now I know that Google Maps can also do this for me, but I'm a nerd and we are at a developer conference so I think it's fair to over-engineer something.

Beyond that, web scraping is another great tool. What this basically is downloading a website and picking out the important bits. There are some legal things here, and I am not a lawyer, so do your own lawyer stuff but an important tip is that when you're trying to scrape a website, look at their robots.txt. Whatever a website you're on, take the domain name and put /robots.txt and it'll have a listing of thing of basically the places you're not supposed to go if you are a computer. Please obey that and you're probably fine, but again I'm not lawyer and this is not legal advice.

And maybe the case is that you combined these two methods. That's exactly what I am doing in this project. I web scraped a bunch of job titles, and then when I had spare time on the bus or something I could click through the links on my phone, read the description and come back to say whether or not the jobs sounded cool. Columns A through D here are existing data and then column E is the augmented data that I'm creating myself.

You're going to need to clean this data. I heard someone speaking at a conference and they said, "Fifty percent of data science is cleaning data." And when he got done he had all these people coming up to him that said, "That's ridiculous! At my job it's eighty percent!" There's two tools that I highly recommend if you're in the Python ecosystem: Pandas (which does a great job of loading data into a tabular format in memory). I've heard it described as "in memory SQL". And then scikit-learn has some stuff built in to massage data into a format that computers can more easily understand that we'll get to in a moment.

Now you may remember this graph from before that had numeric data. Computers are good at numbers; computers aren't as good at words. You may think, "Well, if I had someone's age and net worth, I easily see how those are just numbers. But for something like a job title, that is different. That doesn't feel like I can just type that into a computer and have it fit that into a graph because I I don't even know how that mapping would work. And so we want to introduce something where we take as input the job title and as output whether or not it sounds cool, then turn it into some set of numbers. The great thing is that when you run into a problem like this there are a wealth of giants whose shoulders you can stand upon. You can just Google "text representation for machine learning" and out will pop this probably. This is the idea of word count vectors or "bag of words." Essentially what's happening here is you'd take all of your job titles and you keep track of either all of the words that were used in every single one or maybe the three hundred that are used most frequently. Then you stack them all up, go through each job title, and count that how many times each word occurred in the given job title. So we can see for this first job title "Senior Web Applications Developer" that the word "Engineer" occurs zero times in this job title and the word "Web" occurs one time, etc. I'm not going to bore you by enumerating this matrix but you see how this process works. "Word count vectors" is a fancy way of saying strings of numbers that count up how many times a given word is in a given job title, and that gives us exactly what we're looking for. We can now go from the job title and the output variable (of whether or not it sound cool) to this set of numbers where these first ten numbers are the number of times a given word occurs (so maybe that first number is "senior") and that last number there is a one because that job title sounded cool to me.

And so now can start actually analyzing this data which is great. There's a few tools that I recommend for doing this kind of work. Jupyter is really great; it's an interactive programming thing. Essentially you run it on your computer, and you can load a browser up and do stuff, and it'll show you the output of it immediately (which is super helpful). It's nice being able to see what the data looks like and it's nice to be able to understand what your next step should be. You can also do neat things like drawing graphs, such as the one shown in this screenshot. The maintainers of this project have put a ton of work to make it basically the de facto, interactive, iterative programming tool for data science and data analysis people who are using Python. I spend a lot of my day in Jupyter Notebooks at work.

I also definitely recommend Pandas and scikit-learn (which we talked about earlier). It's nice to not have to re-implement all these algorithms from scratch because other people have already done it for you.

So this is just a little code example to show you how easy this kind of stuff can be. Often times we talk about machine learning and it sounds really scary and foreign. But when you actually look at the code you'll realize this is something anyone can do. This is not complicated, it's just a little bit of understanding how these algorithms work and then reading some documentation. I often call things X and Y because I'm just used to that nomenclature so I take our job titles out of the dataset that I have and I put them into this X matrix. I take whether or not this sound cool and put that in this Y vector. The next line is a CountVectorizer (it just does that word counting thing that we were talking about earlier) and then you can just say, "OK, take this matrix and turn it into the word counts." Then you create a model, you fit the data to it and then you can just call .predict on it and it will give you this beautiful array (that I have here bolded) that says, "Job zero did not sound interesting, and then job four does sound interesting." You can get this all out very easily; it's not a lot of code to get a lot of value.

The thing you want to do after you've been able to gather your data is just do the simplest thing that could possibly work. There's good reasons for doing this. Earlier, no one knew what the Approximation-Generalization Tradeoff was. My hope is that you are about to learn. The idea here is that the better your algorithm approximates the input dataset, the worse it is going to do at generalizing to data that is outside of your input data. That's a little hard to just say and have it be understood, so I made a little graph that I think will help. In the process of making this graph, I first generated a true dataset. You can see here the blue line and this basically says that when we enter zero the value that comes out is zero, and then at the far right end of the scale when we entered ten we expected negative twenty two out. It is a very simple function that we're trying to estimate with our machine learning stuff (it's purely for illustration). And you can see then I generated ten data points from this blue line by just adding a little bit of noise to each point because the real world is fairly noisy for a number of reasons. And then I fit two different models. The red line is a nearest neighbor model (which is a more complicated model than the linear regression model,) and you can see that it does an excellent job of representing the dataset that we gave to it. It is matching perfectly at every single data point there which is great, but you can also see that it is not very close to the truth. If you were to draw more data points from the same truth we would basically find that the red line isn't doing a good job of generalizing to data that it has not seen yet. This would be like if you were taking a practice test for a math class and all you did was memorize the right answers to each question. You would do great at the practice test, but once you got to the real test you would do horribly (because you don't know actually know how to do any of it, you just know the right answers).

By contrast linear regression is doing a much worse job of approximating the input data set. If you see for X value equal zero it's roughly five units underneath that data point and similarly from the range like two, three, four it's not doing a great job either of approximating the input data. However one thing that you'll notice is that line on average looks a lot closer to the truth than the red line does and the reason is that it is better able to generalize to out of sample data. What we can see here is that the red line is doing what is called overfitting, which is when you learn too much of the noise in your algorithm. And that's a real problem that is easy to run into, especially when you're using complicated methods. There's a lot of really interesting stuff on Hacker News that will have you believe that you should use TensorFlow and PyTorch and all these really interesting and exciting deep learning frameworks. And they are all very interesting and very exciting and have great applications, however they are often more prone to overfitting than some simpler models. So it's great to just start with something simple and you can move on from that if you need to.

Another benefit is it often just easier. It's a lot easier to get scikit-learn running on a Mac or a Windows computer than it is to get PyTorch or TensorFlow running. Aside from that, when you're iterating through this development process it's a lot easier to have something that trains really fast and you can try out a bunch of different ways of representing your data or different ways of sampling it (and have that be fast) rather than something that takes a long time to train. In practice with this means is just start with something simple, linear regression and logistic regression are both great models that are good places to start and you can use them both for regression or classification.

So we've gotten to the point here with this that we're in the deployment process. By this I mean that we have these numbers right? We got those numbers out of our model (the zero and the one), but if I were to just look at zero, one, zero, one, zero, zero, one, that doesn't do anything for me. And I also don't want to have to run that code myself every time so one thing that I was thinking (because I am my own user, and I can kinda read my own mind) was I wanted to build this email which you already saw earlier (spoiler alert, sorry) but basically the thought is I don't want to run this thing on my own and I want to get just the relevant and interesting jobs. So what I did was just put it onto a server that I rent and have it run every week, and then send me just the jobs that are interesting. It formats them a little nicely and puts them into an email and sends them off to me. And this is a good thing to do is just build something simple and ship it. Get it working, because the next step here (and if you've gone to any of talks that people have been doing about agile practices this should sound familiar. Because just because we're using data in the software does not make it any less software) is to test out our product. We still need to test out our product, we still need to try it on actual users, we still need to figure out what doesn't work and what does work, we still need to iterate and we still need to iterate again and try something new. And we need to iterate even another time, and we just keep moving and keep trying new things until we get to something that works really well.

So to summarize because that was a good amount of things:

Step one is to have a problem. Find something that you want to solve that sucks in your life
Once you have that thing, phrase it in a way that a computers can help you answer it. There's a lot of things computers are good at. If you have a problem that you can rephrase as a "how much?" or "what kind of?" problem, those are really great candidates for a machine learning application.
Once you do that, you're ready to get some data.
Try the simplest thing you possibly can and see how well that works. You can test out and iterate from there. But it's really important to get this in front of users and to just try new stuff.

So that is the gist of what I have for you today. If you want to learn more, there's a text book called Learning from Data that is really excellent. It does a good job teaching machine learning in a good way. A lot of times when people talk about machine learning, it comes off as a bag of tricks, but this book does a good job of helping you understand some of the theoretical underpinnings that help make these algorithms work. And then from a less academic but more practical side there's a blog called Practical Business Python where the author talks a lot about data visualization and how to do useful stuff with Pandas and it's extremely useful when you're trying to learn this stuff.

Also sponsors are good. We like them. If you are looking for a job I'm sure some of these people are hiring, and we're very grateful to have them here. I would be remiss not to thank my employer (Indeed) because they paid for me to be here so thanks for that. Other than that I hope that you've gotten something out of today and would love to meet you all after. If you have any feedback: if it's negative please email me; I would love to hear what I can do to make this better. If you have any positive feedback please tweet it.

I hope you've gotten something out of today and are better able to go solve your problems.

To become a great Python developer, quit reading Python books

Samuel Taylor — Tue, 31 Oct 2017 11:47:23 GMT

Learning is a crucial part of being a software developer. Increases in knowledge and skill can make a significant impact in our work. But in an age where everyone seems to be in a time crunch, we want our learning to be as effective as possible. We want to get as much learning out of our books, courses, and videos as we can. Counterintuitively, I believe one of the best ways to maximize knowledge gain is by reading books that do not apply easily to technologies or concepts we already understand.

Over the last year, I've read two books that aren't written in Python (the primary language I use at work): Working Effectively with Legacy Code, by Michael Feathers, and Practical Object-Oriented Design in Ruby by Sandi Metz. Initially, I was skeptical as to whether I would gain much from them, as specific tactics seemed like they wouldn't be applicable outside the language the author chose to use for each book. To my surprise, I learned more from these books precisely because they were not written in the primary language I use.

I find three reasons for this. The first is that reading unfamiliar languages causes my brain to comprehend the material better. Because I don't know Ruby at all, I have to put in more work to understand the text. When I read a snippet of Python code in a book, I'm more likely to skim. But when I come across a bit of C++, my brain has to focus to really understand how the code works.

The second reason I think these books had an outsized contribution to my development skills is that they caused me to think through the way to apply the ideas in a new context. While reading about how to make a well-designed Ruby class that was amenable to testing, I was thinking about how to apply those lessons in Python. To be clear, I had to choose to do this. And if you want to maximize your learning, you'll have to do the same. By forcing my brain not merely to understand the information, but to also apply it in a new context, I learned the material more thoroughly.

The final reason I think these books were so helpful is that they are great books. Kind of obvious, right? A well-written book is more helpful than a poorly-written book. When we limit ourselves to just the materials relevant to our day-to-day work, we miss out on gems written in other languages. Plenty of excellent books that can help us become better software engineers exist; we shouldn't exclude some simply because of the language their author chose to use for them.

Putting ourselves in unfamiliar situations is hard. Because of that difficulty, it also has a huge potential to bring about learning and growth. Pick up a book in a language you don't know and thoroughly study it–you may be surprised by how much you learn.

Build a "function with a memory" in Python

Samuel Taylor — Sat, 28 Oct 2017 16:32:48 GMT

Are you familiar with the __call__ method in Python? By defining this method, an instance of your class can be called as though it were a function. Here's a contrived example solely to demonstrate how it works:

class Foo:
    def __init__(self, name):
        self.name = name
        self.call_ct = 0
        print(self.name, 'initialized')

    def __call__(self):
        self.call_ct += 1
        print(self.name, self.call_ct)

bar = Foo('bar')
bar()

baz = Foo('baz')
baz()
bar()

Output:

bar initialized
bar 1
baz initialized
baz 1
bar 2

A more interesting use case is given by Brandon Rhodes, that of swapping out an http_get(url) method for an object that caches pages. Say for instance that we are maintaining a project that includes the following web crawling code:

import urllib.request
from urllib.error import HTTPError, URLError

def http_get(url):
    with urllib.request.urlopen(url) as page:
        return page.getcode(), page.read().decode('utf-8')

def get_links(page_content):
    loc = page_content.find('href')
    while loc != -1:
        start = loc + len('href="')
        quote_char = page_content[start - 1]
        end = page_content.find(quote_char, start)
        yield page_content[start:end]

        loc = page_content.find('href', loc + 1)

def crawl(start_page, max_depth, on_page):
    stack = [(0, start_page)]
    while stack:
        depth, url = stack.pop()
        try:
            code, content = http_get(url)
        except (ValueError, HTTPError, URLError):
            continue

        on_page(url, code, content)
        if depth < max_depth and code == 200:
            stack.extend((depth + 1, u) for u in get_links(content))

if __name__ == '__main__':
    crawl('https://www.python.org', max_depth=1, on_page=lambda url, code, content: print(code, url))

Now say we become unsatisfied with the performance of this code and want to stop getting the same page multiple times. The standard library provides a caching mechanism that we could decorate our http_get function with.

from functools import lru_cache

# ...

@lru_cache()
def http_get(url):
    with urllib.request.urlopen(url) as page:
        return page.getcode(), page.read().decode('utf-8')

But another option is an object that implements __call__(self). What might that look like?

# ...
class CachedHttpGet:
    def __init__(self):
        self.cache = {}

    def __call__(self, url):
        if url not in self.cache:
            with urllib.request.urlopen(url) as page:
                self.cache[url] = (page.getcode(), page.read().decode('utf-8'))
        return self.cache[url]

http_get = CachedHttpGet()
# ...

While lru_cache is probably better in this contrived example, I hope this article gives you another tool for your toolbox. The official docs are here. Keep this in mind the next time you're refactoring something; it may be the right choice.

How to Attend a Tech Conference

Samuel Taylor — Mon, 23 Oct 2017 23:33:01 GMT

Conferences can be both fun and valuable, but they can also be a huge waste of time. Here's some things I've learned that have helped me get the most out of conferences as an attendee.

Consider skipping talks

This advice sounds crazy, I know; the whole point of conferences is to hear the speakers, right? I'm not convinced. The most valuable experiences I've had at conferences are centered around the people, not the talks.

If sessions are recorded, you should go to very few sessions at all. Instead, ask people about what sessions they've found particularly enlightening, then organize "lunch and learns" with your coworkers to watch those presentations. While watching them, feel free to pause them and discuss. This discussion helps everyone involved more thoroughly understand the material and how it may be applicable. Ironically, by skipping the talks, you often get more out of them.

Write in a pocket notebook

While I use my phone to keep track of lots of information, I find that at conferences nothing beats the simplicity of a physical, paper notebook. The benefits are numerous. A notebook isn't going to flash a distracting notification on your screen while you try to write down someone's email address or take notes on a talk. You'll look more put together and less rude when writing something down on paper vs. on your phone. And I find that writing by hand forces my brain to be more concise, a huge plus when you review these notes later.

Aside from notes on any sessions you do end up attending, a notebook is also useful for keeping track of memorable quotes, people's contact information, restaurant/activity recommendations, or (really) anything else.

If you want to be trendy, lots of people like Field Notes, but a trip to your dollar store may yield something similarly valuable at a lower cost.

Create a followup plan

While pen and paper are excellent tools for capturing information, I have a different system for keeping track of my todo list. Mine happens to be electronic and centered around Trello, but the ideas here can be applied to your personal system.

During the day, I gather a stack of business cards and contact information from interesting people. Then, when I have spare time during/after the event, I take pictures of this information and add them all to a single "followup" card in Trello. I make note of a relevant detail or two from each person as well. A few weeks later, I like to email these people to follow up.

If you don't have a plan, you're planning to fail.

Bring business cards

Handing someone a business card is often faster than writing down your email or having them type in your Twitter handle. I prefer to use personal business cards (rather than those from my employer) because I want people to connect with me, not my company.

I used Canva to design and print my cards. Whoever you use, buy the smallest amount you can. The per-unit price difference can make it tempting to buy 1000 cards, but you realistically aren't going to give out 1000 business cards before you become unsatisfied with some aspect of them. When you buy a smaller batch, you have the freedom to change them more regularly. The relatively small per-unit cost increase is worth the added flexibility.

Track expenses on your phone, as they're happening

Keeping track of paper receipts is tedious compared to snapping a quick photo of your receipt. If your company reimburses for travel/conference expenses, see if there's a mobile app for expense reports. By using it, you'll be less likely to forget to expense something and you'll have less work to do when you get back from the conference.

Consider using Twitter

I usually avoid social media, but I created a Twitter account solely to connect with the people I was meeting at conferences. It seems to be the preferred platform for many tech people, and it feels less stuffy than LinkedIn.

Use your calendar

Before the conference, look through the schedule and figure out which ones sound the most interesting. Put the name, speaker's name, and location into your calendar. Even though you shouldn't be going to all of them, you can still enter the most interesting talk in each time slot. By doing this, you should be able to avoid wasting time with schedules when you could be talking to awesome people.

As an aside, this has become one of my favorite use cases for my smart watch (a Garmin FR230). Knowing where you're headed without having to pull out your phone is super convenient.

How I Hacked My University's Registration System with Python and Twilio

Samuel Taylor — Wed, 21 Jun 2017 05:00:00 GMT

Speed up your Python-based web scraping

Samuel Taylor — Sun, 07 May 2017 22:32:17 GMT

Sometimes when I'm working on a project that involves web scraping, the actual scraping starts to slow me down. If you've ever re-run a script and then sat for a few minutes while your computer re-scraped the data, you know what I'm talking about. I've found two simple and practical ways to make this process significantly faster.

For the sake of example, say we're crawling two links deep on the front page of the New York Times. A straightforward way of doing this is:

import requests
from bs4 import BeautifulSoup

def get_links(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    return {e.get('href') for e in soup.find_all('a')
            if e.get('href') and e.get('href').startswith('https')}

links = get_links('https://www.nytimes.com')

all_links = set()
for link in links:
    all_links |= get_links(link)

On my machine/internet, this took about 103 seconds. We can do better than that!

Use `multiprocessing`

Python's multiprocessing module can help speed up I/O-bound tasks like web scraping. Our case here is a good example because we don't need to scrape each link separately; we can run them in parallel. The first step here is to convert our code to use the built in map function:

import itertools as it
# import requests
# ...
# links = get_links('https://www.nytimes.com')

links_on_pages = map(get_links, links)
all_links = set(it.chain.from_iterable(links_on_pages))

On my machine, this ran in a similar amount of time to the original example. From there, using multiprocessing is a quick change:

import multiprocessing
# import itertols as it
# ...
# links = get_links('https://www.nytimes.com')

with multiprocessing.Pool() as p:
    links_on_pages = p.map(get_links, links)
# all_links = ...

This example ran in about 25 seconds (~24% of the original time). The speed-up happens because Python spins up four worker processes[0] that go through links and run get_links on each element. You can tweak the number of processes that are spawned to get even faster wall-clock times. For example, by using 8 worker processes, the script took 16 seconds instead of 25. This won't scale infinitely, but it can be a simple and effective way to speed things up in cases where your code doesn't have to be entirely serial.

Cache to disk

One common use case I have for scraped data is to analyze it in a Jupyter notebook. I have a habit of using the "Restart kernel and run all" option to re-run my whole notebook, but that means the scraping has to run again. I often don't want to wait a few minutes for my computer to do something it already did 10 minutes ago. In cases like this, I've found caching the results of my scraping to disk to be a useful way to avoid re-doing work.

As a first step, let's move our existing code into a function:

def get_links_2_deep(url):
    links = get_links(url)
    with multiprocessing.Pool(8) as p:
        links_on_pages = p.map(get_links, links)
    return set(it.chain.from_iterable(links_on_pages))

print(len(get_links_2_deep('https://www.nytimes.com')))

We can extend our code to cache the result of this function to disk by writing a decorator.

def cache_to_disk(func):
    def wrapper(*args):
        cache = '.{}{}.pkl'.format(func.__name__, args).replace('/', '_')
        try:
            with open(cache, 'rb') as f:
                return pickle.load(f)
        except IOError:
            result = func(*args)
            with open(cache, 'wb') as f:
                pickle.dump(result, f)
            return result

    return wrapper

Now let's use the decorator:

@cache_to_disk
def get_links_2_deep(url):
#    links = ...

After the first time we run this script, it's able to load the cached result, which takes around a quarter of a second. I find this useful while I'm writing and developing some analysis code, but I have to be mindful that to get the most up-to-date results, I need to delete the .pkl file that this is using as its cache. I happily take this tradeoff, and if this technique fits your use case, you should too!

0: I say four here because my computer has four cores. When no arguments are passed to the Pool() constructor, Python chooses the amount of processes in the pool to be the result of os.cpu_count() (docs).

Income inequality in professional sports

Samuel Taylor — Sun, 30 Apr 2017 22:03:47 GMT

On his podcast Revisionist History, Malcolm Gladwell talks about the difference between "weak link" and "strong link" sports. A "weak link" sport is one in which the worst player's skill level has a large impact on the team's success. The example Gladwell gives is soccer, in which a long chain of events must go perfectly to score a point. To get the ball from one side of the field to a position in which a team can make a successful shot on goal requires a lot of dribbling and passing. Every time the ball is passed is an opportunity for the opposing team to break the chain, requiring the attacking team to start the chain from the beginning.

By contrast, basketball is a "strong link" sport [0]. In such a sport, the best player's skill level (rather than the worst) has a large impact on the team's success. A superstar in basketball can take a team to the playoffs almost entirely on his own.

If this "strong link"/"weak link" hypothesis is true and players are compensated proportionally to their contribution to the team's overall success[1], I would expect income inequality to be greater in basketball than in soccer. At this point, I went looking for data.

After some searching, I found Spotrac, which has salary data for the NFL, NBA, MLB, NHL, and MLS. After scraping the site, I had a decent dataset of salaries. First, I looked at a histogram of the salaries:

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(16, 10))
axes[1, 2].set_visible(False)

for ndx, league in enumerate(df['League'].unique()):
    league_df = df[df.League == league]
        league_df.plot(kind='hist', ax=axes[ndx % 2, ndx % 3], title='{} salary distribution ({} players)'.format(league, league_df.shape[0]))

Standard deviation quantifies the variation in the distribution, but comparing the standard deviations across leagues doesn't make sense because the mean salary in each league is so different. By dividing the standard deviation by the mean, we get the coefficient of variation.

aggregates = df.groupby('League').agg([len, np.mean, np.std, np.median])['Base Salary']
cv = (aggregates['std'] / aggregates['mean'])
cv.sort_values().plot(kind='bar', title='std as percent of mean')

This tells us that MLS salaries vary most widely and NHL salaries vary the least. Digging deeper into the MLS, Bastian Schweinsteiger is making $5,400,000 in base salary, with the next highest salary being Tim Howard's $2,000,000. Removing just Schweinsteiger would leave the MLS with a CV of around 1.26, which is higher than the NBA's, but lower than the MLB's.

What have we learned? In terms of income inequality in the American professional sports leagues, soccer actually has the most income inequality, and the NBA has the second-to-least. I think the reason that my initial hypothesis is incorrect is twofold: (1) player contribution to team success is not the only factor in compensation and (2) teams don't universally believe that basketball and soccer are strong and weak link sports (respectively).

I would be curious to see how these results change if we're looking solely at starters (rather than entire rosters), but that'll have to be another question for another day.

0: Daniel Forsyth provides an interesting analysis of this claim

1: They aren't, at least, they aren't solely compensated according to this factor. The team owner also gets value out of selling jerseys and other merchandise, which is easier to do for more famous players.

Similarities between cooking and coding

Samuel Taylor — Fri, 24 Feb 2017 06:00:00 GMT

Host all your projects on one machine with Docker

Samuel Taylor — Sat, 18 Feb 2017 14:07:29 GMT

Like many digital builders/hackers/makers, I have several projects that I want to put on the internet. I don't, however, want to have loads of servers to manage. Given that the highest-traffic project I maintain only around 100 unique visitors a day, it's feasible for me to host them all on the same node.

I already deploy these things in Docker containers, so I imagined I could use nginx to solve this problem. It would work, I imagined, by having a single Docker host running an nginx container that would reverse proxy requests to other containers (which would be running the aforementioned projects). When I started researching how to do this, I found a great project that made it super easy.

Jason Wilder has a great post that you should read if you want more detail, but the gist is that jwilder/nginx-proxy is an nginx container that proxies to other containers. It automatically configures nginx based on the EXPOSEd ports of the containers you're running, which is almost magical.

Start it up like this:

docker run -d -p 80:80 -v /var/run/docker.sock:/tmp/docker.sock -t jwilder/nginx-proxy

You can make this even cooler with a bit of fiddling with your DNS records. Add an A record that points to the Docker host, then add a wildcard CNAME that points to the URL you set up in the A record (see screenshot below for how I have it set up).

(where "1.2.3.4" is the IP of your Docker host)

Now, you can start up a container with a VIRTUAL_HOST environment variable that is in that subdomain:

docker run -e 'VIRTUAL_HOST=rss.project.samueltaylor.org' -tid ssaamm/rss_filter

When you navigate to this URL, your browser will ask your nginx container for the site at that URL, and that container will know to reverse proxy the request to the container you just started. Nifty!

Please reach out if you have any questions or want to get in touch! My email address is sgt at this domain.

The Last 5 Books I Read (as of March 2017)

Samuel Taylor — Tue, 10 Jan 2017 23:31:12 GMT

The Braindead Megaphone, George Saunders

This collection of nonfiction essays was enjoyable and thought-provoking. Saunders writes about the Iraq war, illegal immigration, and Dubai. He presents his experiences in a very relatable way.

The Crying of Lot 49, Thomas Pynchon

I didn't like this one. I put it down about halfway through; I just couldn't get into it. I didn't find anything about it intriguing in the slightest.

Wherever you are, be all there

Samuel Taylor — Sun, 13 Nov 2016 21:45:23 GMT

Being present (paying attention to the here and now) and purposeful (taking actions to achieve some end) helps us to live more fulfilling lives. Jim Elliot once said, "wherever you are, be all there." While it is often painted on flowery backdrops and put on the walls of suburban homes, this quote holds useful advice for productivity and professional growth.

If you're doing something, you should apply all your efforts to that thing, and avoid distractions or other tasks. Turn off your phone, close your chat program, and just work. When your brain encounters a difficult task like learning a new skill or solving a challenging problem, it tries to avoid spending the energy on that hard thing. Instead, you'll suddenly feel the urge to check your text messages or open up Twitter. Your brain might start reminding you about the fact that you need to schedule a doctor's appointment or get your car washed. You've got to fight these urges, or you'll get derailed from the task at hand and become less productive. You will be more productive if you focus on and complete your current task and then apply all your focus to (for instance) scheduling a doctor's appointment than you would if you try to do both at once. To help avoid your brain's weakling pleas for relief from the mental workout, many people find it useful to set a timer for a set period of working on a specific task (see for example the Pomodoro technique).

This quote is also great career development advice. Wherever you find yourself professionally, you should use the resources at your disposal to your maximum advantage. Find the truly great people at your company and try to learn from them. Take them to coffee and ask them how they are so effective. Ask them to critique your work. Over the course of your daily interactions, observe how they handle situations you would be uncomfortable in. Beyond the people at your company, try to get yourself assigned to projects that stretch your abilities.

Life is more rewarding when we live purposefully in the present. By focusing on one task at a time and making the most of our resources, we can become better people and find more fulfillment.

Experiments with Self-Tracking/Quantified Self

Samuel Taylor — Tue, 08 Nov 2016 17:35:48 GMT

How many times in a row do I typically sneeze?

Ever since I can remember, I've sneezed atypically. While most people sneeze once and are done, I often sneeze 3, 4, or even 8 times in a row. I was curious to see how often I sneeze a certain number of times in a row, so I kept track of every time I sneezed over the course of a week. Here are the results:

All this time, I though it was fairly rare for me to sneeze once. In reality, I sneeze once in a row with decent frequency, and I sneeze twice very infrequently.

How long does it take to get to work?

For a little while, I tracked my commute time with a DO Button that logged the time and my location to a spreadsheet. After a little cleaning, the data looks like this:

Writing Better Code: Code as Communication

Samuel Taylor — Sat, 29 Oct 2016 15:17:05 GMT

After we compile our code, why do we keep it? The machine is able to interpret the compiled code and perform the function we wanted it to, so why keep the original code around? Often, the answer to this question is that we may need to modify the program in the future. Modifying a bunch of 1's and 0's in a compiled file would be very costly in terms of developer time. We keep the original code because we can maintain it more easily than the compiled code.

Code is read more often than it is written. As such, when we are writing code, we should keep future developers in mind. Our code is a tool for communicating with future developers who are attempting to maintain it. I find two worthwhile ways to make code more communicative.

Avoid unnecessary comments

Only write comments that explain something which isn't apparent after reading the code.

As an example, suppose we're writing an application that interacts with a web API to get the current weather. We might write something like this:

r = requests.get('http://weather.example.com/currentweather?zip=76706')

A bad comment for this code might look like this:

# make a GET request to the weather API for a given ZIP code
r = requests.get('http://weather.example.com/currentweather?zip=76706')

This comment is unnecessary because it's more or less restating what the code says. Any developer reading this code sans the comment will be able to surmise that it makes a GET request to some weather API for some ZIP code. Because developers can figure this out without the comment, the comment is unnecessary.

Unnecessary comments are unnecessary; who would've guessed? More interestingly, I would argue that they can be harmful. In the future, the API and our use of it may change in a number of ways. Suppose we want to lookup weather with a latitude/longitude coordinate or that the developers of the API require us to make a POST request. We open our code back up, find the line where we look up the weather, and modify it

# make a GET request to the weather API for a given ZIP code
r = requests.post('http://weather.example.com/currentweather?lat=31.5491667&lon=-97.1463889')

Oops! In our haste to update our code, we forgot to update the comment above it, which is now an incorrect description of the code below it. Our program continues to work just fine (the compiler or interpreter doesn't care about the comments), but we've left a confusing artifact for the next developer to find. Unnecessary comments are harmful because they can become out of sync with the code they are describing, creating confusion and slowing down developers.

A better comment might look like this:

# get current weather in Waco
r = requests.get('http://weather.example.com/currentweather?zip=76706')

This comment is better because it's speaking to the reasoning for the code below it. As a rule of thumb, comments should describe the why rather than the what. Because this comment describes the why rather than the what, it is still true when we modify our code as we did abobe--we're still getting the current weather in Waco.

Still, I don't love this comment. It's probably unnecessary, as a developer can probably figure out that we're getting the weather (though they may not know Waco's ZIP code). At the same time, we may not want future developers to have to take the time to read this code thoroughly; that's wy we wrote a comment in the first place! If our intent is to create a program that is easily read and understood, I think there are often better tools than comments.

Write self-documenting code

Name and create the constructs of your programs in such a way as to be easily read and understood by future maintainers.

In our previous example, we were getting the current weather in a given place. Because we didn't want a future developer to have to read our code in order to understand it, we wrote a comment explaining what it did. Alternatively, we could have created a well-named function:

def get_current_weather(zip_code):
    return requests.get(f'http://weather.example.com/currentweather?zip={zip_code}')

Using this function might look like this:

get_current_weather(zip_code=76706)

Because the function's behavior is easy to infer from its name, maintainers don't have to read the definition of get_current_weather to understand what it does (though they can easily choose to). Further, changes to the function can be enforced by the interpreter. Suppose we modify this function to take a latitude/longitude coordinate:

def get_current_weather(lat, lon):
    return requests.get(f'http://weather.example.com/currentweather?lat={lat}&lon={lon}')

Now, if we try to run our program without updating our calls to that function, the interpreter will tell us:

TypeError: get_current_weather() got an unexpected keyword argument 'zip_code'

By creating a well-named function, we not only improved our program's readability, we also made it harder for maintainers to break our program unintentionally.

Conclusion

You should be thoughtful about the code you write because the marginal cost of being a bit more thoughtful on writing the code is less than the cost of the additional time future developers will have to spend in order to read your code.

k-Nearest Neighbors

Samuel Taylor — Thu, 27 Oct 2016 13:29:58 GMT

On 18 Oct 2016, I gave a talk at Austin ACM SIGKDD on the k-nearest neighbors algorithm. Topics included some machine learning theory (approximation vs. generalization, VC dimension), the algorithm itself, proving the algorithm's performance, and some practical concerns around choosing k.

Some other topics that I would probably include next time are similarity functions, high-dimensional spaces, and categorical data.

You can find the slides here. Note that my presentation probably won't make a ton of sense from these slides, as they were mostly aids to the words I was saying out loud. If you've got any questions, feel free to email me; I'd love to chat!

Thanks to everyone who came to watch; I appreciated hearing your feedback!

Python Puzzlers

Samuel Taylor — Thu, 29 Sep 2016 00:32:46 GMT

Default arguments

What is the output of this code?

def foo(arg=[]):
    return arg

my_list = foo()
my_list.extend('abc')

print(foo())

My first intution was that this would output the empty list ([]). However, the output is:

['a', 'b', 'c']

Why does this happen? Well, Python evaluates default arguments when the function is defined, rather than when it's run. As a consequence, if a default argument is mutable and is mutated in one function call, future function calls will be working with the mutated argument.

Check your understanding

Now that you know a bit more about default arguments, what is the output of this code?

def bar(arg=[]):
    arg.append('a')
    return arg

bar()
print(bar())

If you answered:

['a', 'a']

then you're right!

The Last 5 Books I Read (October 2016)

Samuel Taylor — Tue, 06 Sep 2016 00:07:40 GMT

Station Eleven, Emily St. John Mandel

I picked up this book purely for entertainment, and it served that purpose well. The pacing is quick (at times excessively so), and the events are interesting. I found nothing about it revolutionary, but it was well-written. In particular, I liked the dialogue because it felt real (perhaps accentuated by the fact that I read the book aloud).

As events unfold in a few different timeframes, we see some characters and events in a new light. While many of these unveilings worked well, they sometimes felt like they were explaining too much, taking away some of the fun of piecing together the story. Following these threads of story through various timeframes is fun, but feels pointless sometimes. For instance, the author traces a paperweight's journey which I didn't find engaging.

The characterization left something to be desired. Only once did I feel like I understood a character at a non-surface level (the description of Miranda's thoughts on clothes being "armor" gives insight into her post-divorce life/feelings). Still, while I didn't end up caring very much about any of the characters in particular, I kept reading because the story was interesting.

This book seems like one that many will like, but few will choose as their favorite. That's completely fine; not every book can be a total masterpiece. Go into it looking to be entertained, and you probably will be.

So Good They Can't Ignore You, Cal Newport

I loved Deep Work, so I decided to read this one, too. The book is about creating a fulfilling career, which seems appropriate for a new college grad. His premise seems solid to me--"follow your passion" is terrible advice. I appreciate the cynicism of lifestyle bloggers (it seems the only ones making a living are the ones selling lifestyle blogging to people who hate their jobs).

Newport tells the stories of several individuals well. Like all self-help books, these stories keep the book moving while demonstrating a point relevant to the larger topic.

My least favorite part of the book was that the author sometimes comes off as overly cocky, which is annoying at best.

Like lots of books in this genre, it got a little repetitive; he summarizes himself over and over again.

Aside from these two criticisms, though, I really liked the book. I seem to have lucked out in that I'm fascinated with computer science/software development, and the market seems to also like it. Still, the career advice seems well thought out and will be useful in the coming years.

You Can't Win, Jack Black

Originally published as a series of newspaper articles, this book is the autobiography of a rail riding, jewel thieving hobo in the late 19th and early 20th century named Jack Black. He recounts many tales including prison sentences, hobo rituals, and his most interesting crimes. The glimpse Black offers into a very specific subculture is fascinating. If you're interested in reading a collection of true, interesting tales about a life on the road, consider picking this book up.

Pastoralia, George Saunders

This collection of short stories enjoys giving its readers a look into the internal monologues of its characters. The titular story was gripping, forcing its reader to piece together the world from hints in the text.

While the book is a collection of short stories, they are all drawn from the same universe--an exaggerated version of America. I recently finished watching the series Black Mirror, and reading this book reminded me a bit of that series. Though there's less technology in this book, it still feels like the author is using a somewhat imagined world to critique our real one.

Slaughterhouse-Five, George Saunders

This book is a unique blend of science fiction and World War 2 tale. Its central character, Billy Pilgrim, is cast about in many ways by the war. Perhaps uncoincidentally, he sort of trips through time, which makes for an interesting literary device.

This book leaves me feeling very bleak. Pilgrim adopts the belief that everything that will happen will happen, is happening, and has always been happening; we are like bugs trapped in the amber of this moment. This belief takes away hope and meaning. Without these two things, I'm not sure life is worth living.

Quotes/thoughts that I like

Samuel Taylor — Mon, 29 Aug 2016 23:52:25 GMT

There are four hard things in life. If you can do these four things, the rest comes easy:
1. Working hard
2. Doing your best
3. Telling the truth
4. Taking responsibility for your actions
My brothers and sisters, whenever you face trials of any kind, consider it nothing but joy, because you know that the testing of your faith produces endurance; and let endurance have its full effect, so that you may be mature and complete, lacking in nothing. (James 1:2-4, NRSV)
The poison from which the weaker nature perishes strengthens the strong man–and he does not call it poison (Friedrich Nietzsche, The Gay Science)
Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. (Antoine de Saint-Exupery)

Note: the attributions on these are only a best guess

Planning a trip to Europe like an engineer

Samuel Taylor — Sat, 09 Jul 2016 05:00:00 GMT

Books I read in June 2015

Samuel Taylor — Sun, 26 Jun 2016 16:04:55 GMT

This month, I read four books:

Till We Have Faces, C.S. Lewis
A Walk in the Woods, Bill Bryson
Of Mice and Men, John Steinbeck
Into the Wild, Jon Krakauer

Till We Have Faces, C.S. Lewis

I really enjoyed this one. It is a re-telling by Lewis of myth of Psyche and Cupid. I was unfamiliar with the original myth, but appreciated this book very much.

I particularly enjoyed the clash in worldviews between a physically-oriented worldview and one which stressed the importance of supernatural beings. The differing advice offered to Orual by the Fox and the captain of the king's guard is intriguing. The relationship between reason and faith is interesting, so I enjoyed watching Orual process her thoughts on the world through the book.

The book also stands out as a well-written story. I enjoyed the pace of it, and I would highly recommend it to anyone who enjoys mythology.

A Walk in the Woods, Bill Bryson

This was a good read. Every once in a while, I would start to get a little bored by the writing about the history or biology of the Appalachian Trail, but then Bryson would say something to the effect of, "Enough science for now, back to the interesting stuff."

I enjoyed the interactions between Bryson, Katz, and others they ran into on the trail.

On an unrelated note, as of writing, Scott Jurek was about to break the FKT for the AT. Jurek is an impressive athlete, and it's awesome to see such a huge accomplishment.

Of Mice and Men, John Steinbeck

Quick read -- I think it took me an afternoon. I hadn't been spoiled for it, so I didn't know what was coming. This book was very sad, but in an enjoyable way. In a way, the ending sort of snuck up on me. I was observing the world Steinbeck set before me, and then suddenly the ending came. It felt abrupt, and it felt sad.

Into the Wild, Jon Krakauer

While I wasn't a fan of how much Krakauer seemed to revere McCandless, I enjoyed hearing about McCandless's journey.

The Last 5 Books I Read (June 2016)

Samuel Taylor — Sun, 26 Jun 2016 16:04:55 GMT

A View from the Bridge, Arthur Miller

This is a script I read for my American Literature class. It was alright. Even though it has a distant setting, the characters still feel real. They each have their own needs/desires which conflict in certain ways, which causes the drama/action. My favorite scene is one in which they're all in the apartment, and Catherine and Rodolfo start dancing. The subtext in that scene is very interesting.

People talk about the ending being shocking, and I don't know that I agree. There's enough foreshadowing and character development that it isn't a surprise.

Anyway, it's a fine script. I imagine it's better as an actual staged performance, and I would probably go see it if I knew it was being performed.

Questions for All your Answers, Roger E. Olson

This would have been a good book to give a 13 year old version of myself. At that time, I was struggling with how anti-intellectual the church seemed. Fortunately, for a few reasons I came to find and appreciate the intellectual tradition of Christianity. As such, I don't think I'm necessarily the target audience for the book. Reading it is still enjoyable, but it's not earth-shattering for me.

More generally, Olson seems to be caught in the awkward task of speaking intellectually and theologically to an audience hostile (or apathetic, at best) toward theological thinking. His writing feels at times pulled in different direction. In one corner, he's trying to explain complicated theological lines of thought. In another corner, he's trying to keep it simple enough that people without much theological background can understand it.

That said, there are still parts of the book that I liked. The third chapter was particularly good; it talks about cultural sensitivity and the Trinity in a very practical way that still respects the ideas.

Room, Emma Donoghue

I watched the movie adaptation on a plane and enjoyed it thoroughly. The book is written from the perspective of a child, which seems like an annoying premise. At times the perspective is a bit annoying, but overall I was surprised by how well it worked.

All in all, I enjoyed it a good deal. Because of the narrator's limited knowledge, I was constantly intrigued and kept reading to find out more. While I didn't, you could probably read it in one sitting as it's relatively short.

Ready Player One, Ernest Cline

I didn't like the main character at the beginning of the book, which was not enjoyable. I suppose you could argue that this is Cline setting up for some character development, but that isn't a satisfying answer. Ideally, there would be a way to establish that the character is hauty without making readers hate him. Luckily, Parzival/Wade became a more interesting, less prideful character as I got further in the book.

At times, something would happen in the book that felt irrelevant enough to the plot that I would be drawn out of the action and start wondering how he was going to use that later in the book (he always did end up using or referencing it). I understand that foreshadowing is neat, but it felt a little obvious.

Some of the hacking felt unrealistic. Even within the context of a SciFi universe, it doesn't seem like these sensitive systems should be so easily hackable.

Despite these criticisms, this was a very enjoyable read. Admittedly, I'm a nerd, so I don't know how the reference-laden prose would feel to someone who's not a fan of nerdy books and games. The book never felt like it was dragging, which made reading it a good time.

Oblivion: Stories, David Foster Wallace

This book is a collection of eight short stories. Of these, I found three to be wonderful, four or five to be good, and the other one or two weren't my cup of tea.

Incarnations of Burned Children is short, gripping, and emotional. Wallace does an excellent job of making the story seem real and important.

Another Pioneer is a story about an ancient society told through two or three retellings. The levels of redirection allow Wallace to explore some narrative branching patterns I found fascinating.

Good Old Neon is a story about a man who looks introspectively and finds within himself nothingness. His attempts at dealing with this discovery are interesting.

I noticed two common patterns among all the stories. First, I often started reading, got a few pages in, started to think to myself "This one's kinda boring...", and then suddenly Wallace would introduce something that hooked me.

Second, most (maybe all) of the stories end with a level of ambiguity. Because of this, the stories left me thinking about them on and off in the days after reading them.

Reading this book was well worth my time.

The Last 5 Books I Read (February 2016)

Samuel Taylor — Sat, 20 Feb 2016 20:13:01 GMT

Redirect, Timothy Wilson

A book from a college professor about his (and others') research in "story editing" techniques that target the narratives we tell ourselves. I particularly appreciated the thoughts on happiness, as those are the most widely applicable. Because I'm not a parent, I couldn't quite get into the stuff about parenting as much. The discussions of various societal issues and how they might be addressed were also interesting, but not particularly applicable to my life.

The Martian, Andy Weir

I liked this book a lot. The attention to detail made it feel very real and interesting. The point of view from which it is written adds a lot of tension and excitement. As an aside, I enjoyed the book much more than the movie because the book is more detailed and the writing style is enjoyable.

Deep Work, Cal Newport

The first 30% of the book is an argument for why deep work is necessary in today's economy. That 30% is good, but I was already convinced of the value of what he calls "deep work," so they weren't the most interesting thing I've read.

He has many good suggestions. One I've liked so far is to keep a sort of deep work scoreboard, where you log how many hours of deep work you get in each day. Then, in your weekly review, you can use that data to keep yourself accountable. This suggestion comes out of the idea of focusing on lead measurements, which is itself a good idea. If you have a goal (e.g. get more personal projects done), the obvious way to measure your progress on that goal (e.g. how many projects you've finished) is often a lag measure. That is, by the time you see the improvement on the lag measure, you've already made the improvements in your personal processes which allowed for it. By contrast, lead measurements help you quantify the change in your personal processes that will ultimately enable you to achieve the goal you're working toward.

Coders at Work, Peter Seibel

This is a book of interviews with notable programmers. While some of the stories seem dated (most of the people in the book got their start decades ago), it's interesting to read others' reflections on programming.

A detail I like is that several interviewees mention how important reading other people's code is. I didn't really read much code until I did an internship at a software company, and then I almost felt like I was drowning in it. One company I worked for practiced "self-documenting code," so the way I learned what certain pieces of code was simply to read it. Once I got used to it, I liked this system. That internship was the first time I really understood the value of readable code.

Douglas Crockford talks about code quality in an interesting way. I like the idea of taking every seventh sprint to focus on improving the codebase. One of my employers did not focus on code quality, and their company suffered for it. At a certain point it becomes difficult to retain developers when the codebase makes it harder to develop and easier to introduce bugs. I'm sure they didn't intend to get to that point, which demonstrates the importance of understanding quality as an ongoing process.

I enjoyed Joshua Bloch's thought that some coding is more similar to writing prose than it is to mathematics. The way he talks about creating good API's and readable code inspires me to be a better programmer and designer.

The Two Towers, J.R.R. Tolkien

Another interesting book in the series. The sense of adventure and the grandness of the world make reading this book enjoyable.

I want to be Treebeard when I grow up. He's such an interesting character, and I think his orientation to time is very interesting. A detail I like in particular is that his name in the ent language is very long, as ents believe that names should tell something of the thing's story.

At times, I get lost in all the detail and have difficulty keeping my mental picture straight. Perhaps I would gain more out of the reading if I were to hold on to more details, but that feels like more work than I want to put into reading this book. I don't mean to be lazy; I would just rather spend my mental energy elsewhere. Though I lack a lot of the detail (the geography of the area, for instance, is completely lost on me), I feel like I understand the story well still, and I still enjoy reading it.

Teaching Sign Language with Leap Motion + Machine Learning

Samuel Taylor — Sun, 14 Feb 2016 06:00:00 GMT

Useful Python language features for interviews

Samuel Taylor — Sat, 17 Oct 2015 18:03:09 GMT

`collections.namedtuple`

namedtuple can make your code a lot more readable. In an interview, that's helpful for a few reasons. First, it can help you demonstrate a good understanding of some of Python's standard libraries. Second, it helps you show off that you place importance on writing readable code. Third, it makes writing your code easier. If you're passing around tuples, it can be easy to forget what the object at each index into the tuple is. Using a namedtuple can help you avoid that.

Consider the case where you need to represent colors. You could choose to do so with a 3-tuple of the form (i, j, k) (where i, j, and k are integers on the range 0-255). This representation seems intuitive and natural enough. i could be the value for red, j for green, and k for blue. A problem with this approach is that you may forget which of the three numbers represents which primary color of light. Using a namedtuple could help with this:

Color = namedtuple('Color', ['red', 'green', 'blue'])

What does this change? Well, building a Color is almost the same as building that tuple you were previously building. Instead of doing (i, j, k), you'll now write Color(i, j, k). This is perhaps a little more readable, and it adds some more semantic meaning to your code. We're no longer just building a tuple; we're building a Color.

The real win for namedtuple is in access to its elements. Before, to get the red value for a color c, we would use brackets: c[0]. By comparison, if we have a Color called c, we could use a more friendly dot syntax: c.red. In my experience, while not having to remember the index of the red element is nice, the real win is in how much more readable c.red is in contrast to c[0].

`collections.defaultdict` and `collections.Counter`

Suppose your interviewer asks you to find the most common string in a list of strings. We can solve this problem using a defaultdict (let's call it d). We could loop through the list, incrementing d[elem] for each element. Then, we just find the one we saw most. The implementation would look like this:

def most_common_dd(lst):
    d = defaultdict(int)
    for e in lst:
        d[e] += 1

    return max(d.iteritems(), key=lambda t: t[1])

Apparently, users and maintainers of Python saw this pattern enough that they decided to create Counter. Counter lets us write a much more succinct version of this function, because Counter encapsulates the process of counting the number of ocurrences of elements in an iterable. Implementing this functionality with a `Counter object would look like this:

def most_common_ctr(lst):
    return Counter(lst).most_common(1)[0]

These both have the same result:

from collections import Counter, defaultdict

strings = ['bear', 'down', 'you', 'bears', 'of', 'old', 'baylor', 'u', "we're",
        'all', 'for', 'you', "we're", 'gonna', 'show', 'dear', 'old', 'baylor',
        'spirit', 'through', 'and', 'through', 'come', 'on', 'and', 'fight',
        'them', 'with', 'all', 'your', 'might', 'you', 'bruins', 'bold', 'and',
        'win', 'all', 'our', 'victories', 'for', 'the', 'green', 'and', 'gold']

'''
definitions for most_common_ctr and most_common_dd
'''

assert most_common_dd(strings) == most_common_ctr(strings)

But the version using Counter is more concise.

Comprehensions

I love list comprehensions. They can make code much more concise and readable. Consider a problem where we have a start point and an end point on a grid:

|S|_|_|_|
|_|_|_|_|
|_|_|_|_|
| | | |E|

Let's further say that from a given cell, you can travel up, down, left or right into another cell (but not diagonally). We may want to do a bread-first search to find the minimum cost to get from the start to the end. At some point, we'll need to push the neighbors of the current cell onto the queue we're using for the BFS. This could look something like this:

for neigh in neighbors(cell):
    # validate neigh
    queue.append(neigh)

How should neighbors(cell) work? Well, we could use a double for loop to generate the neighbors:

def neighbors(cell):
    for i in range(-1, 2):
        for j in range(-1, 2):
            if i == 0 and j == 0 or abs(i) + abs(j) > 1:
                continue
            yield (cell[0] + i, cell[0] + j)

This works, but it's ugly. Instead, we could use a list comprehension:

DIRS = [(0, 1), (0, -1), (1, 0), (-1, 0)]
def neighbors(cell):
    return [(cell[0] + d[0], cell[1] + d[1]) for d in DIRS]

We're also probably going to want to keep track of which cells we've already visited (so we don't try to go back through them). We could create a matrix of bools the same size as our original grid (let's call it visited) and set visited[r][c] when we visit the cell located at row r and column c. But how should we initialize this matrix? We could do something like this:

visited = []
for i in range(n):
    visited.append([])
    for j in range(n):
        visited[i].append(False)

But list comprehensions can make this much more concise:

visited = [[False for _ in range(n)] for _ in range(n)]

The possibilities with list comprehensions are just about endless, so I'll leave it at that!

Hackathon report: TAMUHack 2015

Samuel Taylor — Sun, 11 Oct 2015 15:39:45 GMT

Idea

Two weeks ago, I went to HackTX and won a Leap Motion. While thinking about what things we could build with skeletal tracking and gesture recognition, I thought it would be cool to build a language learning tool (like Rosetta Stone) for American Sign Language. My friend Matt also thought it sounded cool, so we decided to build something like that at TAMUHack.

Environment

TAMUHack was fun. The venue was called "The Zone", which is a big room in A&M's stadium. All 300 of us were in this massive room, along eight big tables. Being in the same room as everyone else was really cool; you felt like you were all a part of something. I've been to other hackathons where I've not been able to find a seat in the main areas; being separated from the rest of the teams is not fun. The organizers of TAMUHack found a great solution to that problem--put everyone together!

Project evolution

We started to build something that would simply transcribe signs of the ASL alphabet as a user signed them above the Leap Motion. By around 3:00am, we had that more or less working. Playing around with it, we knew it definitely wasn't perfect, but it showed promise.

The Leap Motion is not particularly well-suited to sign language recognition. In our research during the hackathon, we found a research paper that said the Leap Motion in its current state isn't a good choice for recognizing Australian Sign Language.

As an initial run at solving this, we decided to implement some simple Markov chain analysis. The idea was that if certain letters commonly precede others, we should be able to figure out that a person signing "q" will probably sign "u" next. That idea didn't end up helping us out all that much; we tore it out later. After we input some more training data, the recognition was good enough that we could work with it.

At that point, we had some time left and felt like we could keep going to make something cooler than what we had. We decided to make the language learning tool we'd originally planned on making. By 8:00am, we had a basic version working. The app would show you a picture of a sign and ask you to replicate it. Once you had, it would give you 100 points and pick another sign for you to make. After 30 seconds, you could enter your score on our leaderboard. We decided to make our project into a game because it seemed like a fun way to demo the tech we had built.

We ended up finishing about an hour before projects were due. We were so happy to have built something so cool and fun to make in such a short amount of time.

Presentations

We set up our area with our laptops and the two external monitors we brought. We each ran a copy of our app on an external monitor and had the Leap Motion visualizer on our laptop screens. This ended up being super useful; we could show people what the Leap Motion was seeing in real time.

Getting to show off our project to judges and other hackers was super fun. People thought it was super cool and were excited to play around with it.

We got into the top six and were asked to present at closing cermonies. Awesome! It was a little rushed because things were running late, but I still enjoyed getting to talk about our hack in front of everyone.

Apparently the judges also thought it was cool, because Matt and I won second place overall!

Thanks

TAMUHack was super fun. Huge thanks to the organizers, volunteers, and judges.

Political implications of BitTorrent

Samuel Taylor — Thu, 08 Oct 2015 01:42:24 GMT

BitTorrent is an inherently political technology which embodies decentralized political order. Additionally, it broadens the definition of art. Despite the negative side effects of the technology, BitTorrent is worth pursuing.

What does it mean to say that a technology is political? Winner outlines two possibilities:

I…offer…two ways in which artifacts can contain political properties. First are instances in which the invention, design, or arrangement of a specific technical device or system becomes a way of settling an issue in a particular community. Second are cases of what can be called inherently political technologies, man-made systems that appear to require, or to be strongly compatible with, particular kinds of political relationships.…By ”politics,” I mean arrangements of power and authority in human associations as well as the activities that take place within those arrangements.

What does it mean to say that a technology embodies decentralized political order? Applying Winner’s thought, such a technology would either appear to require or be strongly compatible with a kind of arrangement of power and authority with regard to human associations and activities within such associations. Inasmuch as the information age has made information power, to say that a technology is strongly compatible with decentralized political relationships means that the technology decentralizes control of information. We would expect such technologies to be dangerous to centralized power structures.

BitTorrent decentralizes control of information, and thereby embodies decentralized political order. Each person in the swarm has the power to download the information, and each person in the swarm also has the responsibility to upload the information to their peers. In fact, with the usage of distributed hash table technology, even a centralized tracker is unnecessary; peers can coordinate file transfer themselves without the need for a tracker (BitTorrent.org).

Decentralization of information control and access is the natural end of BitTorrent. Imagine that BitTorrent exists in some centralized power structure where only one torrent tracker exists. In order for some authoritarian, centralized power to keep said power, it would have to be able to keep control over the way that BitTorrent is used. And for a time, that central power could control BitTorrent. But starting a new tracker is so easy that controlling such action over a long period of time would be almost impossible. To prevent people from starting new trackers and attracting users away from the Official TrackerTM would require coercion on a scale that is hard to imagine, let alone implement.

Furthermore, BitTorrent is dangerous to centralized power structures. Look at the example of music. Record labels are a powerful, centralized entity in the realm of music. If BitTorrent is a threat to centralized power, then we should expect to see record labels seeking to control BitTorrent. The Recording Industry Association of America (an organization made of record labels/distributors) seeks to exercise control over the ways in which people use BitTorrent. The RIAA targets both users and trackers, attempting to get such high punishments as to scare people away from using BitTorrent for sharing music.

Consider another example of a centralized power structure: the People’s Republic of China. China censors much of the internet and has started to censor BitTorrent websites (Van der Sar). These two examples of centralized powers fighting to control BitTorrent provide a compelling argument that there is something about BitTorrent which encourages decentralization of power.

Another way that BitTorrent decentralizes power is in the way that people discover content. Again, consider the example of music. In the past, certain people have had much more power over the sharing of music than others; radio DJ’s, journalists, and the like were able to exert power over the music people listen to. Before the advent of peer-to-peer technology, a non-DJ’s ability to share music with others was limited to those physically/geographically nearby. Peer-to-peer technology allows for music sharing to occur through the internet; a user can now share her favorite music with anyone (Franchini). BitTorrent enables users to share not only the knowledge of some song or artist, but the very music itself.

Perhaps the most beneficial societal contribution offered by BitTorrent is the decentralization of content distribution. It allows creation and distribution of art to happen without the support of powerful backers. Before BitTorrent, distributing a television show required some power over the broadcasters. Even in the internet age, the bandwidth costs of distributing a “television” show can be very expensive. The unique opportunity afforded by BitTorrent is to share the load of distributing the content among a large “swarm” of peers. Because it drives distribution costs down, BitTorrent liberates content creators from distributors; they can distribute their own content.

It also changes the ways in which users support their favorite artists. In a world overwhelmed with file sharing, supporting an artist has become less about buying the artist’s physical records/CD’s and more about buying band merchandise and tickets to concerts (Franchini). This change in artists’ revenue models from being primarily based on selling albums to being based on selling concert tickets and merchandise is also recognized by the artists themselves. Winston Marshall, the guitarist for Mumford & Sons, says that “Music is changing.…We look at our albums as…adverts for our live shows” (Stern).

Combining these effects, BitTorrent decreases the distance between content creators and content consumers, thereby encouraging more people to become content creators. Consumers no longer must go through a middle man to access their favorite creators’ work. They also take an active role in the re-creation of said work. As a result, consumers develop more direct relationships with the creators of content they like. Finally, because distribution costs are lower, consumers are more likely to become creators, and they will not have to seek the help of powerful distribution/broadcasting middle men. BitTorrent removes the necessity for a powerful middle man.

A counter-argument to the claim that BitTorrent embodies decentralizes power is that certain players in the BitTorrent ecosystem possess more power than others. The Pirate Bay, for instance, is a huge tracker which has lots of power. However, the existence of powerful players within a system does not imply a lack of decentralization of power. Users can still choose whether to use the mega-websites or the smaller ones. An abundance of torrent websites still exist and have power. This means that even though some are more powerful than others, power is largely decentralized in the BitTorrent ecosystem.

This decentralization of power is a good thing. Distributed power is inherently good in a society which values not being dominated by another person. If power is centralized, then the entity with the power is able to dominate whomever they so desire. American society values not being dominated, so this decentralization of power brought about by BitTorrent is good for society.

Another effect of BitTorrent (distinct from the decentralization of power) is that it changes what the word “art” means. Rodriguez-Ferrandiz dicsusses the effect that digital copies have on art as a whole in an abstract sense. In essence, the importance of the “original” work becomes less important. Possessing the original version of a song does not matter all that much when every copy of a song is perfect. In that BitTorrent makes the recreation of art extremely inexpensive and completely accurate, the quality and accessibility of copies of individual works of art mean that having the original is not significantly better than having a copy for most individuals.

Rodriguez-Ferrandiz specifically writes of photography, noting that it has caused “the focus of interest” to switch “from the work as a singularity that physically retains the creator’s touch to a vision of the work as a multipliable and liberated piece which removes distinctions between original and copy”. An earlier author, Benjamin, who is cited by Rodriguez-Ferrandiz refers to the distinction between original and copy as the “aura.” What of digital art, then, for which there is no difference between originals and copies? Rodriguez-Ferrandiz argues that a “paradoxical aura” exists for such art. Because the art is not defined by the way that it is represented in binary on a hard disk, it “transcends physical form” and becomes “immortal.” Though BitTorrent does not qualitatively change this trend or contribute to it in a novel way, it does offer a quantitatively larger realization of this immortality by making the reproduction of digital art far easier.

This change in the meaning of art is a good thing. It broadens art to include digital arts, giving artists a new medium for creativity. In American society, creativity is valued, so this change is good for society.

Of course, the technology is not without its drawbacks. Nothing inherent to the protocol stops its use from including mass distribution of child pornography or other unquestionably bad things. The entire idea of decentralized control is antithetical to the censorship of BitTorrent as a medium (whether or not the censorship is of things society generally agrees are bad). The government will not be able to stop the spread of child pornography through BitTorrent.

This inability to censor terrible things is not a reason to stop usage of BitTorrent technologies. First, this problem is not unique to BitTorrent. Many technologies make spreading morally repulsive content much easier (e.g. the internet, books, compact disks, pencils). Second, BitTorrent requires a large swarm of users for effectiveness. To say that certain things are generally agreed upon to be unacceptable in a society implies that the number of people who will participate in such behavior is low. Thus, BitTorrent is a bad fit for child pornographers and sharers of other repulsive content.

BitTorrent embodies decentralized political order. It broadens the definition of art. Because these are both good things, BitTorrent is a technology that is worth pursuing despite its drawbacks.

Works cited

BitTorrent.org. “BEP 5: DHT Protocol.” Link

Van der Sar, Ernesto. “China Hijacks Popular BitTorrent Sites.” TorrentFreak. 8 Nov 2008.

Rodriguez-Ferrandiz, Raul. “Benjamin, BitTorrent, bootlegs: auratic piracy cultures?.” International journal of communication.

Stern, Marlow. “Mumford & Sons Diss Jay Zs Tidal.” The Daily Best. 12 April 2015.

Winner, Langdon. “Do Artifacts Have Politics?” Daedalus, Vol. 109, No. 1, Modern Technology: Problem or Opportunity? (Winter, 1980), pp. 121-136.

Hackathon report: HackTX 2015

Samuel Taylor — Tue, 29 Sep 2015 01:36:29 GMT

Idea

My team and I started thinking of ideas on the drive down to Austin. We ended up deciding to do something that would automate the scheduling of meetings.

Project

We started writing a Python/Flask app hosted on Azure. Getting the project deployed initially was easy, as was setting up continuous integration. Last year, we wrote a PHP web app, and none of us were able to use our computers to test. In essence, the production server was also our development server. Using Flask was awesome because it comes with a development server.

Wes started to work on the UI, I worked on figuring out a way to read people's emails with Context.IO, and Evan worked on user account management. He started to use Flask-User, but then we couldn't get it to work on the Azure configuration we had set up. It was late at night, I wasn't sure what other library to use, and Wes was starting to hate Python, so we made a hard decision and switched everything to PHP.

At this point, we had to set up DeployBot to do continuous integration, and we went back to the issue of not having servers to do development on. As a result, the git log got pretty terrible.

I got a script to check a user's email inbox for emails that looked like someone trying to schedule a meeting and set it up to run on a cron job while Evan worked on our SendGrid integration.

The actual scheduling logic came into play much later in the day than we would have hoped. Luckily, it didn't turn out to be too challenging, so we were able to get it implemented and finally create our app's core functionality.

In the end, our product was definitely more hacky than any of us would have liked, but it worked well enough to demonstrate.

Presentation

Once again, presentations were "science fair" style this year, which was great. Several judges came around and asked about our project. Our pitch was something like this:

I've been doing job hunting lately, which involves a good amount of emailing back and forth to coordinate interviews. This process is a tedious chore; sounds like a job for computers! We built Schedule Ninja, an awesome computerized ninja that slices and dices your meetings so you don't have to.

Users log in with their Google account, which we use to pull in their availability through Google Calendar. We read their email in order to find messages that look like someone trying to set up a meeting. From those emails, we generate a request on the user's dashboard that they can either accept or deny. If they accept the request, Schedule Ninja emails the requester back with the user's availability and asks them to click a link to confirm their meeting time.

Schedule Ninja can also be used to request a meeting with someone else. The user types in an email address, and we detect whether that person is on our service. If they are, we are able to avoid email altogether, compare the two people's schedules, and set up a meeting for them automatically.

We got some great feedback from several judges who wanted to sign up for the service immediately. That felt very validating; we had built something users actually wanted! Unfortunately, we didn't place overall, but we did end up winning sponsor prizes from Microsoft and Indeed.

Conclusion

We had a lot of fun and built a useful product in 24 hours. Big thanks to the HackTX organizers, Context.IO, Square, Microsoft, and Indeed for all their feedback and help. If you've not gone to a hackathon, you should definitely sign up! They're super fun!

If you want to get in touch for any reason, I can be reached at sgt@samueltaylor.org. Thanks!

The 10 Best Ingredients for Cheap Cooking

Samuel Taylor — Sat, 12 Sep 2015 15:51:54 GMT

The ten most frequently-used ingredients on Budget Bytes are:

Salt
Garlic
Olive oil
Eggs
Brown sugar
Oregano
Water
Cumin
Yellow onion
Pepper

This information was gathered using a scraper I wrote with Python 3, BeautifulSoup, and Requests.

Notes: Boston Python User Group, Lightning Talks, 22 June 2015

Samuel Taylor — Sat, 27 Jun 2015 20:44:53 GMT

On 22 June 2015, the Boston Python User Group had a night of seven lightning talks. These are notes I took for personal use; they're not a perfect re-telling of what each talk was about (or even what each talk was called).

#1 Python for making connections in groups

Speaker: John Hess

John and a friend distant from him in his social graph each ended up being stood up by friends at the same bar. They decided to sit down and solve the world's problems. They ended up enjoying their time, so John wanted to find a way to automate this sort of process.

The idea is something like this:

A group of people sign up on Maven
The service selects a group of about 4-6 people and texts them to see if they're available
As people accept or decline invitations to the event, Maven will text more people from the group to get them in on the event
Maven then puts everyone into a group message so they can organize the event

John found that:

Building stuff is easy in Python
Python is a Swiss army knife, but it can't do everything (for example, mobile development)
While building stuff is easy, building stuff that is user friendly is really hard
John kept iterating, putting the product in front of friends, getting feedback, and trying new things

#2 Django cloud management

Speaker: Robert Paul Chase

I was semi-lost on this one. The project was related to genetics somehow, and I know nothing about computational genetics.

He built a cloud management platform that lets biologists and researchers (read: not developers) easily spin up nodes, install necessary software, run their code, and kill their cluster when they're done with it.

#3 `.format()`ing without tears

Speaker: Richard Landau

The standard str.format() method in Python will throw a KeyError if a name isn't found in the dictionary. Rick made his own function to avoid that problem. Here's how it works:

Uses a regex to get the names both with and without braces (e.g. "{foo}" and "foo")
Zips those two arrays together (to get ('foo', '{foo}'), ('bar', '{bar}'), ('baz', '{baz}'))
Constructs a dictionary from that array (the format of the tuples will work with dict())

At first, it seemed to me like another way to implement this behavior would be to provide .format() with a dictionary that, instead of throwing a KeyError when encountering an unknown key, would return a modified version of the key which was asked for. I tried to do that, and it turns out that doesn't work

class FancyDict(object):
    def __init__(self, dictionary):
        self.__dictionary = dictionary

    def __getitem__(self, key):
        try:
            return self.__dictionary[key]
        except KeyError:
            return '{' + key + '}'

    def keys(self):
        return self.__dictionary.keys()

if __name__ == '__main__':
    params = { 'foo': 'this is foo', 'bar': 'this is bar', 'baz': 'this is baz' }
    print '{foo} {bar} {baz}'.format(**params)
    # this is foo this is bar this is baz

    params = { 'foo': 'this is foo', 'bar': 'this is bar' }
    print '{foo} {bar} {baz}'.format(**params)
    # KeyError: 'baz'

    params = FancyDict({ 'foo': 'this is foo', 'bar': 'this is bar' })
    print '{foo} {bar} {baz}'.format(**params)
    # KeyError: 'baz'

str.format() gets the keys of the dictionary and will throw a KeyError if any of the strings in curly braces are absent.

#4 Test all the data

Speaker: Eric J Ma

Testing data is important because you have some assumptions about it that may not always be correct
He talked some more about how to do that using PyTest

#5 Visualizing Yeast ChIP-Seq data

Speaker: Luis Soares

I was completely out of my league on the domain of this one, which was something related to biology.

It looked like a neat web-based visualization project.

#6 Payment reform

Speaker: James Santucci

I wasn't super familiar with the domain (statistics).

The big takeaway was that how we measure value affects how much value we observe. I'm not sure what that means.

#7 Hypothesis: property-based testing

Speaker: Matt Bachmann

Hypothesis is a Python library inspired by Haskell's QuickCheck
You put a decorator on your test to say what kind of data it takes
It works with most testing frameworks
You write a small amount of code, but get a big amount of functionality tested

Notes: Boston Django Meetup, Intro to Flask, 25 June 2015

Samuel Taylor — Sat, 27 Jun 2015 20:44:53 GMT

On 25 June 2015, Ned Jackson Lovely spoke about Flask at the Boston Django Users Meetup Group. I took some notes and am putting them here so I don't lose them; they may or may not be useful to others. Theses notes are not exhaustive.

Functions that are decorated with @flaskapp.route() should return:

a string
a tuple of (response, status, headers)
flask.Response / current_app.response_class
a WSGI callable
In theory, your function could return another Flask app, or spit out a WSGI callable that generated a massive CSV on the fly and streamed it to the user
You can do testing:

def test_splash(): client = app.test_client() response = client.get('/') assert response.status_code == 200 assert 'form' in response.data
The Werkzeug debugger is awesome
Defining filters for templates is possible (and ostensibly simple)

Sessions in Flask are interesting

Session data is serialized to JSON, cryptographically signed, and set in the user's browser as a cookie
Because it's client side, it doesn't matter which datacenter they end up in
If you write to a "second level" in session, you need to set session.modified = True for the changes to get written out:

session['first']['second'] = 'new thing' session.modified = True
Flashing is done with sessions, and is useful for displaying those one-time, web app-y messages like "Your post was submitted"
On a general Python note, contexts (with/as) are really cool -- you only have to implement two __ methods to get the benefits of them
Some useful libraries: SQLAlchemy, WTForms

Blueprints:

Helpful when your app gets bigger
They're very similar to Flask objects with an additional "namespace"
app.register_blueprint(bp, url_prefix='/counter')
To get GET/POST data, use request.values

Deploying:

gevent, gunicorn, nginx, ansible
supervisor
pip freeze

Hackathon report: Hack@Teal 2015

Samuel Taylor — Sat, 27 Jun 2015 20:44:53 GMT

My friend Matt Tinsley and I were both RA's in Teal Resdiential College from Fall 2014 to Spring 2015. We are also both big fans of hackathons. Matt had the idea for and led the organization of a hackathon for Teal residents. I helped him out with logistics, and we worked on a project in the time we weren't doing organizer-y things.

The first two sections of this article relate to organizing the event, and the third is about the project I worked on.

Successes

Lots of things went well.

Learning -- Many people during demos said something to the effect of, "I've never done anything like this project before today. I learned so much." Our faculty master was impressed with how much people were able to learn, which was good seeing as he funded much of the event. Education was also a huge reason we wanted to have the event, so it was great to see our efforts pay off.
Team/idea formation -- At the beginning of the event, we had some time for individuals who came to the event to find a team. Matt and I stood in a circle with the individuals, and we all went around and said an idea we had. This format worked well; the four people we had formed two teams. I would recommend that organizers bring some simple ideas for people to work on; both of my ideas ended up getting used.
Venue -- Teal allowed us to use the media room, which had ample space. We were also able to bring over plenty of tables from a neighboring classroom.
Google forms -- good choice for our voting needs. I could imagine that for larger hackathons, there would be too many votes to use Google forms, but it was just what we needed for how many participants we had.
Food -- We had ample food, people seemed to like it well enough.
Hosting -- We had one team who wanted to do some stuff with web technologies, but didn't have much of an idea what they were doing. I spun up a VPS through Digital Ocean, and Wes Cossick explained a few basic things to them. Like that, they were off!

Hardships

A few things were less than perfect.

Staying through the night -- I wish more people had chosen to stay through the night. Around 4 or 5 am, the room started to feel dead.
Judging -- Our faculty master did a great job of judging, but I think it would have been cool to have gotten a panel of judges rather than just one person.

My project

Matt and I worked on Emoji Predictor. If you've ever used a smartphone keyboard that suggests words as you type, you'll understand what it is. Our project suggests relevant emojis for you to use in your text messages.

We started with a database of all Matt's text messages. This would be our "corpus", or the body of text we would use to make inferences about which emojis should be used with which kinds of messages.

While making an iOS keyboard would have been really cool, we wanted to make a proof of concept and focus on the part of the project we found interesting: getting from a string to the emojis most relevant to it. We decided to make a web UI. I whipped up a simple application using Python, Flask, and JavaScript (our code can be found on GitHub).

While I was working on the UI side of things, Matt started working on the recommendation engine using Python Natural Language Toolkit. While he worked on that, I decided to work on a different implementation of a recommendation engine. I loaded all of Matt's sent messages which contained emojis into Elasticsearch and ran a query on that index using user input. This basic implementation ended up working decently enough.

Matt ended up having tons of trouble with Python and unicode, so for demo purposes we went with my implementation. I thought our product was pretty neat.

Because it relied on Matt's personal information, a live demo unfortunately isn't up anywhere.

Conclusion

Despite a few minor problems, Hack@Teal went very well. I was glad to help, and I'd love to take part in organizing more hackathons. Because it was a small event, Matt and I were able to hack on our own project, which was fun and educational.

If you want more information (especially about other people's projects), please see the official website for Hack@Teal 2015, hackteal.me.

How to: remove Etsy search ads

Samuel Taylor — Sat, 27 Jun 2015 20:44:53 GMT

Install Greasemonkey or Tampermonkey

Firefox users: Greasemonkey
Chrome users: Tampermonkey

Install Etsy Ad Remover

Click this link to install

This script simply removes ads from search result on Etsy. If you're curious, check out the source code on GitHub.

Implementation

Etsy search ads have children with a CSS class of .ad-indicator. It's literally one line to remove those. This was a fun way to figure out how to make the web less annoying through using browser dev tools and Greasemonkey.

If you have any feedback, please contact me at sgt at this domain. Thanks!

Thoughts on Effective Java

Samuel Taylor — Sat, 27 Jun 2015 20:44:53 GMT

I'm reading the second edition of Effective Java in a group at work and writing some thoughts/notes about it here.

Chapter 2

Item 1 is about static factory methods.

The leader of my group offered a point that static factory methods can be hard to mock. I don't have a ton of experience at this time with mocking objects, so I haven't seen that first hand, but I'll trust him and keep it in mind for the future.

Item 1, advantage 4 says that static factory methods are good because they reduce verbosity in creating objects. The example they give is:

Map<String, List<String>> m = new HashMap<String, List<String>>();

Java 7 introduces the diamond operator, so this can become:

Map<String, List<String>> m = new HashMap<>();

thereby negating the verbosity-decreasing benefit of static factory methods.

I'm not trying to say the book is wrong. It says "Revised and Updated for Java SE 6" on the cover, and for Java 6, that seems like a valid argument. I just think it's interesting how new language features can change what constitues a best practice. The book even says, "Someday the language may perform this sort of type inference on constructor invocations as well as method invocations."

Item 2 is about builders.

In the process of talking about builders, the author talks about the JavaBeans pattern. This pattern seems like a terrible idea to me; forgetting to set a required parameter is relatively easy and could have disastrous results. The Builder pattern seems like a better choice because it's a way to give the compiler more information. I would rather have my IDE yell at me at compile time that my object isn't instantiated correctly than wrestle with bugs at run time.

Builders do introduce more code to write/maintain/test, but (as my group leader pointed out) the IDE can generate the class for you.

The book has required parameters going in the builder's constructor and optional parameters being set by additional methods. My question is: what if the number of required parameters gets large? Then you haven't solved your problem at all. One option would be to move the required parameters into methods, but then you're not providing the compiler with the information to know that some of the parameters are required. Yes, the build() method can check for them and perhaps throw an exception, but that only happens at run time.

I think that if you have so many required parameters, there might be something wrong with your design. Perhaps some arguments are logically related and should be combined into an instance of a new class which binds them together. I'm not sure if this is always the case, but it seems like it often would be.

Item 3 talks about singletons.

Our group leader told us to beware of the hidden state that often comes along with singletons. I have definitely run into that issue. When functions use state that is not from a parameter, things can get tricky. Knowing what state is used and how that will affect the execution of the function can be difficult for the developer.

While the book says that, "a single-element enum type is the best way to implement a singleton," our leader disagreed. He argues that using an enum is less readable. I had a similar gut feeling when I first read this part of the Item; I would not have thought to use an enum to implement a singleton. In my mind, an enum represents an enumeration of a few different kinds of things; a singleton is something that there will only ever be one of. These two ideas seem at odds.

Item 4 talks about noninstantiable classes. These classes are often used for utility methods.

Again, apparently static things are difficult to mock. And Java's garbage collector has apparently gotten good enough that doing something like (new FileUtility()).getPermissions(file) will result in the created FileUtility being garbage collected very quickly. This all happens fast enough that there is very little performance impact.

Item 7 says to avoid finalizers. I learned C++ in school, so the lack of guarantees with finalizers throws me off. In any case, I've never heard someone seriously advocate for finalizer usage.

Chapter 3

Item 8 is about the general contract for equals. The terms used in the contract are familiar from discrete math.

For value objects, you want to override equals(). equals() gets tricky when inheritance comes into play, though. One solution to that problem is to not use inheritance with value object -- make your value objects final. Inheritance can be useful for business logic, so feel free to use inheritance in that case. For classes that implement business logic, though, it doesn't make much sense to implement equals().

An interesting thing I hadn't thought about is that instanceof checks for null:

public class Main
{
    public static void main(String[] args)
    {
        String nullString = null;
        System.out.println(nullString instanceof String);
    }
}

Output:

false

One suggestion was to use an EqualsBuilder, like the one supplied in Apache Commons. Apparently, equals builders will help you avoid NPE's. To me, this seems like a cop-out. I don't think it's too terribly difficult to avoid writing an equals() method which won't throw an NPE; perhaps I haven't written enough Java.

Item 9 is about hashCode(). Equal objects must have equal hash codes.

The book contains an overview of how to write a hashCode() method that's good enough. Ground-breaking, cutting-edge, crazy high-performance hashing functions are going to be class-specific. Sometimes, this is fairly obvious; if your class has a unique ID, you can just return that as your hash code.

Item 10 is about toString(). This method should only be used for debugging or logging purposes.

The StringBuilder class gets this wrong--its toString() method is used for programmatic access to the string that is being built. It should probably have another method called build() to provide programmatic access, and leave toString() for logging purposes.

Item 11 is about clone(). To be frank, I find the Cloneable interface confusing, and I haven't run into a good use for it.

Item 12 is about compareTo(). The hard thing with compareTo() is that it's not particularly explicit about what the "natural" ordering means. By contrast, a Comparator<> can have a name which gives developers more information about how the comparison is done. This explicit information is probably good.

As of Java 7, if you break the general contract for comparisons, an exception will be thrown.

How to: get free WiFi at coffee shops

Samuel Taylor — Sat, 27 Jun 2015 20:44:53 GMT

Some coffee shops place time limitations on their WiFi. For example, I recently went to a Panera that had a 30 minute time limit on their WiFi during lunch hours.

Getting around such limits isn't difficult. I'm not sure how ethical it is to do so, so consider this all merely educational information.

It seems these kinds of systems track you based on your MAC address. If your MAC address changes, the system thinks of you as a new user. Changing your MAC address is easy enough (on Linux):

Figure out what interface you're using.

Run ifconfig and look for the one that looks like right. Mine was wlan0.

Change your MAC address

ifconfig wlan0 down ifconfig wlan0 hw ether a1:b2:c3:d4:e5:f6 ifconfig wlan0 up

For ease of changing, you can make a script that looks something like:

#!/bin/bash

ifconfig wlan0 down
ifconfig wlan0 hw ether $1
ifconfig wlan0 up

And then run it like ./change_mac.sh a1:b2:c3:d4:e5:f6.

If you want to get in touch, email me at sgt@samueltaylor.org. Thanks!

Hackathon report: HackTX 2014

Samuel Taylor — Sat, 27 Jun 2015 20:44:53 GMT

Idea

Most college students are aware that registering for classes can be a real pain. At Baylor, we don't have a waitlist on many of our classes. If you fail to get into a class, you are doomed to logging in several times a day and hoping a seat is open. This process is a waste of time; we wanted to automate it. CourseWatch makes registering for classes suck less.

We thought about creating a mobile app that would run in the background on users' phones. Such an app would require no infrastructure on our end, which would be great. However, it would require users to download the app and enter their login credentials. We decided to go with an SMS-based solution. Users would text us a course number, and we would text them once their class opened up. Because signing up was as simple as sending a text, this option would have low friction, which is good for user acquisition.

Here's the project page on HackerLeague, which has some basic information about the project.

Project

I set up a private GitHub repository on my account and added Wes and Evan as contributors.

After we found a work space and set up our laptops/devices Wes started working on creating the screen scraper, Evan started researching other colleges' registration systems, and I started setting up our server. We decided to use Microsoft Azure because the Azure team was offering free hosting. I set up an Ubuntu server with Apache, MySQL, and PHP.

Wes helped me set up dploy.io to deploy our code automatically from the master branch of our repository. I'd not used a continuous deployment service before and was pleased by the ease with which it let us deploy our code. A pitfall with using this service was that because our changes to master were automatically deployed, it was easy to get sloppy and push untested code to master in order to test it on our server. This problem is our own fault and only requires more discipline to fix. Within a hackathon, though, it was not a major issue.

Once the server was set up, I got to work handling inbound SMS messages. I had experience with using Twilio for SMS from HackTX 2013 and chose to use it again because it's easy to use and inexpensive.

After a few minutes of work, Evan finished with his research and wanted to continue helping. We didn't have a clear design at that point, so I didn't know what he could work on. As Wes continued to work on the screen scraper, I spread out a napkin and started drawing the system out. Through this process, I identified three main areas: notification subscription (which I was working on), screen scraping (which Wes was working on), and notifying users. Being that nobody was working on notifying users, Evan started working on that.

Evan had never used PHP before, so he needed some direction. He sat at the keyboard, and I sat next to him. Within an hour or two, he had figured out enough PHP to be dangerous, so I moved to work on subscription through SMS.

Wes sort of reverse engineered BearWeb, which was an interesting and tedious process. He opened up Charles and clicked around Baylor's internal course registration website. He would then perform the same requests in PHP using curl. After several moments where everything seemed hopeless, he eventually got everything figured out.

A little after midnight, we had version one done. You could text in a course number and would get notified a few minutes later that it was open (registration wasn't open yet, so all seats were open). Wes then set to creating a website for the product while I worked on adding SendGrid integration so that users could sign up for notification through email. The SendGrid API was also a pleasure to work with.

By around 3:00, the website was looking good enough and we had gotten SendGrid integration working. At this point, we decided to get some rest. We napped in the hallway for a few hours then got up to figure out our presentation.

Presentation

HackTX 2014 did "science-fair style" presentations, which I liked. Each team had a spot on a table, and a number of judges came around to check out each project. During this time, participants were encouraged to walk around and check out each others' projects. I was able to check out a few of the projects near us, but did not spend much time looking around; I wanted to get feedback on what we had made.

We received a lot of positive feedback from judges and other participants. People liked the high level of polish in our product and presentation. They also believed we were solving a real problem in a good way. Unfortunately, we weren't chosen to present during the closing ceremonies. On the plus side, we won three sponsor awards:

TripAdvisor prize -- 3 day trip to Boston
Best use of Microsoft Azure -- Dell XPS U12 laptop
Best use of SendGrid -- Jawbone MINIJAMBOX

Summary

HackTX 2014 was great. We had fun, built a great product, and won some awesome prizes.