<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>Samuel Taylor – Blog</title><link>https://www.samueltaylor.org/utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>Building good software is challenging but rewarding. On my blog, I share the things I'm learning as a practitioner in this exciting industry.</description><lastBuildDate>Thu, 21 Oct 2021 01:04:10 GMT</lastBuildDate><generator>PyRSS2Gen-1.1.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><image><url>/static/img/me.jpg</url><title>Samuel Taylor</title><link>https://www.samueltaylor.org/utm_source=rss&amp;utm_campaign=st-blog-feed</link></image><item><title>How to Join a New Team and Learn a New Codebase</title><link>https://www.samueltaylor.org/articles/how-to-join-a-new-team.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Alternate title: &lt;em&gt;new codebase who dis?&lt;/em&gt;&lt;/p&gt;
&lt;div class="embed-responsive"&gt;&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/O2eEfPtaWA4" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;/div&gt;

&lt;p&gt;Delivered at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.southerndevfest.com/speakers#h.dv39kanwh16k"&gt;Southern DevFest 2020&lt;/a&gt;. Slides available &lt;a href="/static/pdf/ncwd_sdf20.pdf"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Find me on Twitter &lt;a href="https://twitter.com/SamuelDataT"&gt;@SamuelDataT&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Transcript&lt;/h2&gt;
&lt;p&gt;0:00&lt;br /&gt;
Alright, yeah, thank you all so much for being here. Great to be here. Let's get started. When I started studying computer science in college, it's kind of I realized that the software industry is kind of like this pizza. Stay with me, I know, that's a weird statement. When I was in college, I'm basically looking at this pizza, I can see this little triangle. And I see that there's some pepperonis on it. And these pepperonis are kind of like, the things that I knew about at the time. So I'm thinking, when I go into the software industry, I'm going to be all about working on figuring out how to balance abl trees and do quadratic versus linear programming and hash tables, and all this stuff that I was learning in my classes. And all of that stuff was super interesting, I think, really valuable. But what I realized was, once I got into the industry, there was actually a little bit more like this, the software industry is is this pizza, that we really have a lot of other things in it that are not just crazy, algorithmic, really interesting algorithm problems. But there's a lot of there's a lot of other stuff in there, too. And so this part of the pizza that has just cheese on it, it's still really great, still really important. But it's things like knowing how to set up Jenkins builds, knowing how to configure Iam properly, figuring out your build tools, like understanding how Maven works, all those kinds of things are all super important. And without them, you would not have the pizza basically. So that's what I want to talk about today is one of these particular areas, that is not a pepperoni, we're going to be talking about how to join a new team, and specifically how to learn a new code base. I think that it is really difficult to teach this. And I really wish that when I had been in school, someone had at least tried. So that's what I'm going to try to do for you in the next little bit of time, we'll see how it goes. I think there'll be plenty of tips in here for you, no matter whether you are a new developer, if you are relatively experienced, I hope to have a few tips in here. That'll be helpful too. I know there's a lot of other really interesting talks. So before I jump in, let y'all have a chance to there's other cool tracks happening. Also, if you don't want to learn about finding a new codebase, now's the time for you to leave, because that's what we're gonna be talking about today. This is the way we're going to do that. Right now, you can probably tell that you're in the introductory period. After that, we're going to talk about what I think you should do day one when you join the team. And then we're going to talk about the mindset that you should have when you are reading and writing code before moving on to talking about the process by which I recommend and I try to use always when reading and writing code. And then finally, talk about some tools that can make this all a little bit easier. So on day one, there are three things that I personally believe you should do. The first of those is to set up your development environment. And when you're doing this, you should make sure you are really paying close attention as you do that. There's a lot of hints in this process that can help you understand what sort of world you're getting into on this new team. And as you develop in this new world. So that you should try to understand what services you're running as you get your development environment going. For instance, if you're suddenly running Redis, you might think to yourself, ah, we have some sort of caching layer, I wonder what that's about. This can also give you some hints as to what dependencies exist between services. So for instance, if you have one project that needs to be running before you get another project running, that gives you a hint that that second project probably needs the first project to run and there's some interrelation there. You should also when you're doing this, take specific notes on what you're doing. And you can do this on a piece of paper, I find it really helpful to just open up some sort of note app on my computer and literally copy and paste in commands that I'm running. I think anyone, I tend to sort of come from the Python world, and anyone who has ever tried to set up their dev development environment has always found it really helpful to know exactly how you screwed up your system Python installation later. So it's good to make sure that you know exactly what you're doing. The last thing I would say regarding setting your development environment up is ideally the project that you're working on will have some sort of test suite. And if you can get that test suite running and make sure that all the tests are green, that probably means that you've got things set up correctly. So that's just a win on its own. But once you've got that all those green tests going, the next thing that you should try to do is break those tests, literally open up a random file in the project and just delete some lines and see what tests break. This will start to give you an understanding of what pieces of code relate to what other ones.&lt;/p&gt;
&lt;p&gt;4:36&lt;br /&gt;
The next thing you should do on day one is ask some senior members of the team to give you some sort of overview of the architecture that you're gonna be working on. The first thing that you can do here is try to find some sort of document or diagram it doesn't always exist. But ideally, there's some sort of wiki page that you can go to and see a drawing of kind of the way that things are laid out. And when you do that, make sure you check it out. When the page was last edited, because sometimes and in in some places, in some situations, the architecture of a system can change pretty rapidly. And if you're looking at something that is old, it might not relate to the current world. And that would be really sad for you to figure out the wrong things. You should ask them a lot of questions just ask, like your team. In a healthy team that is functioning well, people will be excited to answer your questions, because they know that answering the questions is going to help you become a more effective teammate, which even just for selfish reasons, when I have new people joining teams that I'm on, I want to give them as much information as I possibly can. So that way, they can start doing some of this work, and I don't have to do the work. That's the whole goal. So don't ever feel bad about asking questions. A few questions you should think about asking, what repositories do we own? What things are we working in frequently? Are there any repositories we share with other teams? How is this all structured? another great question to ask is, how does a feature get from running on my laptop to actually being visible in production, this will give you some really good insight about sort of the the like deployment pipeline, which can be really helpful to know about, depending on what size of company you're in, you may need to know a lot about that, or you may not need to know very much about at all. And you also will want to ask what kind of vendors what API's are we relying on so that you can get a sense for what those things are at really helpful early, when you can provide to the rest of your team is to at the end of this process, create a new architecture diagram, or at least update the existing one, if it does already exist. This is a nice way to sort of get build some early goodwill with your team. And you can roll up to everybody and say, Hey, I made this architecture document, I believe it's accurate based on what I've heard, here's what I have. And this can be something that's really helpful not just for the other people on your team, but as well for new members who will probably join your team in the future. The third thing that you should do on day one is figure out what the business does. Because if you don't, you are completely doomed, you will never be successful. If you don't understand the business, at least to some degree. It's good to know what the mission of the company is, what sort of products it offers, and the goals that it has different companies will do this in different ways. They will use different goal setting frameworks. But it's really nice to know what are we trying to do as a company. Once you know that, you need to know how your team is going to contribute to that. So is it you know, we work on this part of the product, which helps us achieve this specific goal. And the way that we can achieve that specific goal is to you know, improve conversion rates over here, something like that. Great questions to ask include, how can we impact the goals of the company who will get mad if our code breaks, just horrifically, this is a really useful one because it not only tells you who relies a lot on your on your software, but it can also give you a sense for how careful you should be. Obviously, it's never a great idea to break production. But it's definitely good to know if I've production for for one minute, am I going to lose the company one cent, or $1 million. There's a significant difference between those two things. And understanding those trade offs. And there's the the context in which you're working is very important.&lt;/p&gt;
&lt;p&gt;8:18&lt;br /&gt;
Okay,&lt;/p&gt;
&lt;p&gt;8:19&lt;br /&gt;
that's day one. The next thing that I want to talk about is the mindset that you should use when you are working in a new codebase, particularly in a new team. This can all be summarized, as learned by doing. There's a book that I read recently that I really enjoyed called ultra learning by Scott young. This book is sort of about how to teach yourself things, Scott young as this sort of autodidactic figure who has spent a lot of time thinking and writing about productivity and learning. And he's done a bunch of crazy things, including going through all four years of an MIT computer science curriculum in one year. And so then he wrote this book about people who do those kinds of crazy learning projects. And it's really interesting to hear about some of the principles for learning effectively. One of the ones that resonated super strongly with me, is the idea of directness. Scott young says, The easiest way to learn directly is to simply spend a lot of time doing the thing you want to become good at. In this case, we want to spend a lot of time working on this team. We are joining the team to produce valuable software. And so what you should do is just start doing it, start working on the software, start building the stuff. That's the best way to get better. When you are making an impact. That's the best way that you're going to be able to gain any kind of deep understanding. What you want to make sure you're doing as you develop and read and understand is understand the code that you are trying to work on well enough that you can make the change that you need to make to have the effects that you want to have to make an impact on your team and on your product. What you don't want to do is read every single line. I think this is a trap that can be sometimes easy to fall into is thinking I needed To understand every part of this program, I need to understand every part of the architecture before I start messing with things. And that was something that I sort of felt like at the beginning of my career, and now understand that, you will gain that understanding over time. And actually sort of counter intuitively, the fastest way to gain that understanding is going to be by doing the work itself, when you're reading stuff, without any sort of directed goal of trying to say, um, you know, I need to make our scoreboard service predict, you know, be able to produce 2020 high schools instead of 10. That gives you sort of a goal to work toward and makes your brain think about things harder, and make sure that you're understanding things better.&lt;/p&gt;
&lt;p&gt;10:43&lt;br /&gt;
When we talk about reading code, I worry that sometimes we we've overloaded the word read too much reading book is incredibly different from reading code. And I don't know that that's necessarily, really, the differences between those I don't think are elucidated strongly enough. For an example, imagine if the Lord of the Rings series was written as a software artifact, you would get things like, Okay, well, there's not three books that are, you know, this long, you would have maybe 1000 books that are all four pages long. And you'd have a case where some of these, you know, little tiny books are bound together. And some of them have like, there's this one book that has all the magic in it, except for any magic that's related to grass is actually over in this one for some weird historical reasons. And actually, all the sword fighting also happens in that one, except for actually sort of fighting with magic is back in the magic one. And you can see how this would get very complicated very fast. In reading a book and writing a book, you're trying to sort of create a coherent narrative for people to follow. And ideally, when you are writing code, you are also trying to create a coherent narrative. But the structures we have for doing that are dramatically different. When when we're writing code, we don't just have one very long file. Ideally, if you do have just one very long file, that's not a great time, ideally, you're able to break these things up into smaller units that can then be understood more easily. And that's something that I think is an interesting concept, that your knowledge will grow in sort of a recursive way. So you start at the top. And on day one, you know, you're getting this architectural overview, you're understanding what the various services do. And then you're going to get a ticket. Ideally, ideally, your team has sort of queued up something or teed something up for you, that would be a good fit for a new person to help them learn some pieces of the system, and you get this new ticket. And so now you need to figure out, Okay, I'm gonna have to modify this service in order to be able to get this thing done. And then you're going to dive into that service and say, Okay, here's the sort of modules that are in that service, or here's a little you know, what, however, the service ends up being organized, visit with these different things are and sort of, more or less what they do. And they can try to understand, okay, I think I'm probably going to need to modify this particular module, what are the classes that are in here? What are they all doing, and you can sort of progress down this path, until at a certain point, you get to an individual line of code. I think, if you're writing code, reading a single line of code is very, it's probably very easy for you unless there's some like weird syntax that you're not familiar with reading the line is relatively self explanatory, you can see a line and say, This line is incrementing by one, or you can say this line is creating a new database access object. That's not particularly complicated. It's all the steps above it that are hard. And what I would like to say is that what you can do in this sort of recursive pattern, is when you get down to understanding the level of individual codes, the next thing you should be doing is creating chunks. And what I mean by chunks is this concept of neuroscience, where chunking sort of refers to the brain's ability to bind detailed information to a concept that is easy to remember. So for instance, I don't have to know that a particular function call gets a handle to the database, and then builds a query that it wants to run includes certain parts of the query based on the parameters passed to the function, runs the query handles errors, comes back with my result, modifies it in some way. There's, there's all these very detailed steps that are happening inside this function call. But ideally, at the end of the day, I can see the function and say, Okay, I know what all the detailed steps of this are, I know that this function is getting high scores. And that's a much easier concept to remember and think about, and reason about, and stick inside of a human brain than trying to remember 12 detailed steps 12 individual lines of code all at once. In other words, all we're trying to do here is make sure we can see the forest for the trees, we don't need to focus on every individual tree, every individual branch on that tree, every individual pine needle, we need to see is that's a tree, or we need to know is this function gets high scores. The last tip I have in terms of mindset is to think about this in two different ways. And there are a couple different ways to think about this that I at least find useful. The first one is To think about it, sort of in terms of code pads, and so you might think, Okay, the first step is that somebody makes an HTTP request to slash puppy. And then that calls my get puppy method that I have, you know, my routing setup to go to, and then get puppy uses my puppy manager object to get a random puppy and then inside get random puppy, we're actually running this query through our database accessor. This is one way that I find really helpful to think about these things. And it is also really helpful again, to draw these little diagrams as you're going along can be really useful.&lt;/p&gt;
&lt;p&gt;15:34&lt;br /&gt;
The other way to think about this is the sort of the way that data flows through the system. And different systems will make more or less sense to think about in one or two, one of these ways or the other. And so when you're in a situation where you're not sure, try to do both, and see which one helps you the best is what I would recommend. When talking about data flows, I mean, maybe we think about the scores living in some database, and then that flows into a score data access object that happens to know information about scores. What's interesting to think about is what objects know what things at what point in the program, and how does that data flow through the system. So maybe scores are known about by the score data access object. And then the score controller might use score data access objects. And then finally, that information by might end up getting passed through to a some sort of front end client, a scoreboard, j. s, for example, this is a bit of a contrived example. But hopefully, the idea is clear. So that's more or less the mindset. Actually, before I talk about process, there's one more thing on mindset that I've forgot to make a slide for. Sorry, just have to listen to my dulcet tones. And one last mindset thing is, you might not be wrong. And I think sometimes when we join teams, or at least, I know that I have struggled sometimes when I join a team to assume that the team I'm joining has gotten everything perfectly right. And they've thought through everything super well. had any suggestions that I might have? are, are foolish in some way. And I think that is a bad instinct to have for a number of reasons. Firstly, because no one is perfect. So the team might actually have just made a bad mistake. And you being able to bring up Hey, is there a reason we're doing this in this way I've been used to seeing it this other way, can be really helpful. And we can help spur really good discussion, and can also help lead to new insights. For instance, if you have like some particular way you really think is the best way to implement a singleton. And you bring it up with your team. And they're like, Oh, actually, we've been doing Singleton's this other way, then at least one of you gets to learn something by the end of it. Either you learn that your way is not optimal. Or they learned that your rate is optimal. And But either way, there's learning happening. And that's a good benefit. So don't think that you're always wrong about things, definitely bring up your concerns and your thoughts and try to learn stuff. It's very helpful.&lt;/p&gt;
&lt;p&gt;17:57&lt;br /&gt;
Okay, let's talk about process. And the short version of this is that I think the scientific method is one of the most impressive sort of achievements of the human species is being able to understand how to gain knowledge about the world in a scientific way, I think is incredible, that we have figured out how to do that. And so I think we would be remiss not to use that same process when we are reading and writing code. So when we are we get a ticket in this is roughly the process that I think we should follow, the first step is going to be to find out what code is relevant. So if I'm trying to work on something about a scoreboard service, I probably don't need to worry about user authentication that just isn't particularly relevant. So we need to do is identify what that relevant code is. The next thing we need to do is form a hypothesis about what we need to change. In order to do this, we're going to have to understand the code that we're looking at and working on well enough to actually form this hypothesis. Next step is going to be to test your hypothesis. So make your change and see if it was correct. At this point, if there's any people who are really into test driven design, in the YouTube comments, I'm sure you're blasting off about how like, test driven design is the coolest thing ever. And it's really useful. I completely get it, there's a school of thought that says, even before you make your change, you should write a little test that makes sure what you think is going to happen, it's going to happen. And I completely agree that that is a really useful tool to be able to quickly understand whether your hypothesis was correct. And even even before you make your changes, it's nice to have that specifically written out. And still at times, it can be difficult to do this. And so it's also completely acceptable, in my opinion, to just make your change and then manually verify whether it worked. Now, just because you verify that it works doesn't mean that you're done. The last step is to improve the quality of whatever it is that you've just written. So if you didn't write a test at first, writing tests now is super important. If you avoid writing tests for your code, the code That you write immediately becomes legacy code, and very difficult to maintain. What's going to happen if you don't write the assess is later something's going to change, and you're not going to things are broken, where they, when you should have already known that from automated testing, or you're gonna need to refactor is going to be a huge pain, right? test, please. And then other things like improving the legibility and maintainability of your code, I think are really important to do at this stage, make sure that because code is read far more often than it is written, make sure that you're optimizing for the reading case, I know that it's it's sometimes convenient to write code, and leave variable names as I or, like, have really short abbreviations for function names. I know that's very nice for the writing case. And that's often sort of optimizing for the writing case, but code is generally written once, and then it's going to be read, you know, multiple times, potentially 10s, hundreds, thousands of times by, you know, many individuals potentially. So you want to make sure you're optimizing for the reading case, and making sure that it's very easy to read and understand the code that you're writing. And that's what you should be doing in this last step. Okay, let's talk about tooling. And there's a few different sets of tools that I think are useful. But as far as why I think these tools are useful. Let me say that writing code is one of the I would say sort of developing software is generally the most difficult part of the job as a software engineer, at least in my experience, is one of the most mentally taxing parts. And what that means is any when we can get in terms of decreasing the mental load, we have to be under when we are doing this is a huge win. because it not only means that we, you know, the job is just easier, which is a nice thing. But if you think about our level of our level of skill and our level of ability to execute on something, if we're maxed out at our at some sort of level of skill, of course, we can grow. But our sort of Max skill is by practice over time. But at a certain point, if we are maxed out against we are only capable of doing so much. And then one easy way, or one way that you can sort of be able to achieve more is by using tools that help you offload some of that cognition elsewhere. And so that's the way that I try to think about tools is they are basically, you know, it's kind of like the old Steve Jobs code, it's a bicycle of the mind, it's going to help your mind to go faster and be more efficient. And these tools are really great. They're not a replacement for thinking, but they can make the process far easier. You certainly could get by with just using like grep, and a text editor, and you could probably figure out everything. But it's going to be a lot harder and a lot less efficient than using some of these tools.&lt;/p&gt;
&lt;p&gt;22:51&lt;br /&gt;
Um,&lt;/p&gt;
&lt;p&gt;22:52&lt;br /&gt;
I kind of think about tools in a number of broad categories. The first category being this the step when we're trying to find what code is relevant, there's a few things that we should do there. The first is to just run the code. So like, if you have some sort of, like web application, just get it running, and understand what exists, go click on the page that you're supposed to add a button to try to understand, where am I going to be adding this? Do I need to create the whole new page? Do I just need a button somewhere, understanding the context that you're working in is going to really help you when you're writing this code. And another tip that I would say, didn't occur to me until far later than I care to admit, is that using the debugger is not just helpful for when you found a bug and you're trying to understand what's going on. It's also super helpful to do for completely working code, because you can run your program with the debugger enabled. And just start clicking through and seeing what lines of code get executed to do various things in the system, which is invaluable. Like you cannot, you cannot gain that understanding in hardly any other way. In hardly any other more efficient way than just using the debugger and seeing what lines of code get run. Other than just running the code. I think searching in the project can also be a really helpful way to find relevant code. There's a number of tools that I mean when I say searching the project is going to vary from company to company significantly. Some common tools that people use for this kind of stuff are JIRA, Asana, Pivotal Tracker, GitHub, and get lab both have issues features. There's a lot of other tools that people use to track all this stuff. And it would be foolish not to at least try to find information in these things. Generally. A lot of times we are standing on the shoulders of giants in this field and we are able to leverage some sort of amount of past work in our current work. And what's useful to be able to do is understand what has already been done before you're in the situation so that way you know what you can reuse. And this will help you again get further along than you would be able to if you were just starting from scratch. I cannot tell you how often I have heard Other engineers and developers say things like, oh, the only reason I was able to get this done so fast is because I saw someone else in the company had done a project that was similar to this. And I was able to copy the config that they had for their spark job. And that enabled me to get my spark job running a lot more easily and not have to run into all these weird errors and things like that. So searching the project is super helpful. And one way that that looks, is to just literally open up JIRA, I went to the spark JIRA, because it's a, an open source project. And it's all available. So I'm allowed to show you it. But if you just search for like, Hey, I'm working on a ticket in spark about date times, like, let me just search date time in here and see what comes up. And then you can sort of change the way that things are ordered, understand what's happening in any given ticket and sort of give a glance over some of these things. And understand what work has come before you can be super helpful. also helpful to find out, sometimes you'll get a ticket, and it will like be something somebody has already done. And you can sometimes find that by doing this. This is what the GitHub issues UI looks like. This can be really helpful for you understanding what problems people are having, what what people are working in sort of what areas and large companies that can be particularly valuable to know, sometimes, you don't even know who to ask about something. And if you can figure out who has been working in this area, they can often sort of help you along and guide you along the path. And the other broad category of tools that I would describe when we're talking about finding relevant code, our code search tools. So I use one called the silver searcher maybe every day, it's super useful. there's a there's a another tool that's very similar to it called rip graph that I haven't used, but I've heard is really good. And just to give you an idea of what that looks like, if I'm working on a ticket that's about phonemes, for instance, I might go into my project and just type ag phoneme in the, in the console, and see everywhere in the code that the word phoneme is used. And now I might know Oh, hey, look, we have something in source slash app.pi. And I can go in here and understand this is where we're talking about phonemes. And it's a really good way to just figure out like, Where is this thing even being used or talked about in the code base, and can help you find help you find those relevant services, modules, classes, etc.&lt;/p&gt;
&lt;p&gt;27:19&lt;br /&gt;
So these are really great tools. If you're working on code that you have checked out locally, because they work on you know, the code that's running, you know, the code that's present on your hard drive. But they're not as great when you have a large number of projects that, you know, might be spread throughout the company, you don't necessarily have all of it checked out onto your, onto your laptop. And you might need to know, hey, like I'm going to modify this thing in our service. But I want to make sure it doesn't break everybody else's. This is where these, these other tools can come in handy. Things like open grok, or source graph are really helpful, because they can let you search the entire code base of a certain organization, namely your organization. GitHub and get lab both also have a feature where you can search code within your specific organization. You can also search issues within your specific organization, that kind of thing. This is what open grok looks like I found this screenshot on the internet. One way you can see this happening here is that this person has searched for util with anything before and anything after it as a definition. And so this will look specifically for places where a class that has utility and the name of it is defined. And so you know, this person can now say, hey, file utility finds a file util. form, utility finds forum, util, etc. This is a really useful way to understand what code is relevant again. When we talk about understanding code, that's sort of the next step in the process, I would highly recommend using an ID. Again, you can totally get away with just using, you know, a text editor, and grep. Even if you want to use sort of superpower yourself with those search tools. That's also really helpful because you can do things like, hey, suddenly, I'm using, you know, this, this code is calling file util dot something, what does that meant to do? And you can search, you know, pull up silver searcher and search for utils, dot whatever, and find it. And that's really helpful. I think that's awesome. No no qualms with that. But in my experience, using an ID has been super helpful. I'm a big fan of JetBrains. I'm completely unaffiliated with them. But for instance, one thing that they have that's really nice is the ability to Command click on things. This is a little recording I took just so I could show you what that looks like. I sort of hear when you Command click on something. And let me scroll back here just a second. So when you command if I'm reading this get puns method, for instance. And I'm wondering, hey, this word, the phonemes function, what does that do? I can hold down command on my keyboard, which I think you know, if you have a Windows computer, you'd hold down Ctrl and click on it and it will take you right to the definition Have that file or sorry, have that function, which is very helpful. And then once you get to that function, you can actually do the same thing again, hold on command, click on the name, and it will show you usages of that. And so you can see, okay, we're defining this is used here, but it's also used in this other place. This is beyond useful. If you're using some sort of common methods. If you're finding trying to find out like, how do people instantiate this object, you can just go to the object and see where people are using it, which is going to be also you, you could read the documentation, or you could read the code and the code is gonna be a lot faster. As you're doing this reminder, create chunks, as we're going through this, we need to make sure that we are able to see the forest for the trees, we do not need to care about individual trees, we do not need to care about individual lines of code, what we need to be doing is understanding word to phonemes well enough that I can in my head, say, Okay, this method takes a word and turns it into a list of phone needs. And that's enough for me to then use that function effectively. If you're trying to keep around all of the lines of code that you've read in your head, you are not going to have any success, it's gonna be very difficult to do this. As you're doing this, one way to help guide you're thinking and sort of offload some of the work is to take notes, draw little diagrams. Again, this can be as simple as just getting a piece of paper out and drawing little little stuff with a pen, that's completely legit. You don't have to, you know, break out lucid chart and start doing UML diagrams or anything like that, you can just use a pencil and paper, it's completely fine. And I am consistently shocked at the amount of productive work that you can do when you get enough people around a whiteboard in a shared space. And that is one of the great tragedies of the pandemic is that the lack of ability to talk through problems on a whiteboard, by words can be super helpful for this kind of thing as well. I personally really enjoy using a digital note taking app for this, even if I've drawn a little diagram, sometimes I'll take a picture of it and save it into into my Notes app. So that way I can have it later if I'm working on the same thing. Another thing would be really useful as you're doing this, just in that same Notes app, just write out what you're what you're working on. This is a pro tip for those people who feel like they get interrupted by meetings a lot. If you are,&lt;/p&gt;
&lt;p&gt;32:20&lt;br /&gt;
if you are like getting interrupted, having those notes is really helpful. Because you can go back and look and say, This is what I was working on at that time. Finally, again, make sure you're asking for help if you run if you run into trouble. And obviously try it for yourself first spend 15 minutes trying to work through whatever problem you have. But if you don't know what it is after that you should ask because you're you're wasting your time at that point. And a tool it's really helpful for this is something called get blame. This is a screenshot of a git blame for a particular file of a particular project that I have used and liked. And what you can see here is on over on the right, we have a like, you know, source code, basically. And then on the left side, we have the information that has like who's been working on that particular line, which is helpful, because you can then know, hey, if I'm wonder about get var type, I need to go talk to romaine x, I don't know how to pronounce that name, sorry, romaine x, if you're in the audience, let me know. And finally, when you're working with libraries, these are very common, obviously. And they often have documentation, that's pretty good. And I definitely recommend reading those if they're good, but sometimes they're not. And you have to use Stack Overflow. And that's completely legitimate. The, one of the more recent things that I have discovered is super helpful is being able to use GitHub search effectively. Because so many people use GitHub, and they allow you to search through all public code, you can learn quite a bit about how to use libraries just through that. So what that could look like, if, for instance, if I'm trying to set up a BigQuery job, I'm going to be using this query job configuration object. And let's say I'm reading through the starter page on like the GCP docs. And it gives me a good starting point, I you know, understand where I'm going with this. But I need to know, are there certain, you know, parameters that are commonly used on a career job configuration, that kind of thing is really helpful to know sometimes, and GitHub searches amazing for this, a huge fan of it, I highly recommend it, just go to GitHub, type in the thing and see what comes up really useful. Two pro tips on this, the first one, change the sort order. This is completely anecdotal. But I have found that sometimes adjusting to do most recently indexed is really, for some reason, getting the better results. I don't know why completely anecdotal. And then also choosing what language is is reliable, because you'll find language specific examples that help you sort of get a better hit rate in terms of what you're actually looking for. So this is that same search, but if I'm looking for recently indexed Java files that are using query job configuration, I can search this and see okay, these people have some sort A free tier billing service that uses query job configuration. And now what I can do is click specifically on this line that says 250. And see, here's the, you know, query job configuration builder that they're setting up, and I could see what options they're using. And that gives me a better understanding of how to set up that builder for myself, which is beyond useful. That was a lot. I'm gonna take a deep breath, I'm gonna drink some water, because I want to Everyone calm yourself, and then we'll we'll talk a little bit more.&lt;/p&gt;
&lt;p&gt;35:41&lt;br /&gt;
Okay, there's a few takeaways. If you take away anything from this, here's the short and sweet. Make sure that when you're on a new team, you are working on software, make sure you are focusing on delivering valuable software I'm reading without a goal is not going to help you understand software any faster. And the best way to learn is by doing a really helpful thing to do is to focus early on providing value to your team, that could be creating an updated architecture diagram. And that could be just, you know, getting something working soon. That's that's a really nice thing to be able to do for your team. This is nice, not only because like that's what your job is, and that's what you're paid to do. But also, you're able to build goodwill, when you are showing your team, hey, I want to be a valuable member of this team, I want to support everybody. And like I'm here for you kind of thing. And people will be much more likely to reciprocate that feeling. If you are leading on that, that's a good way to do it. Finally, make sure you are decreasing cognitive load as much as you possibly can. There's a lot of really good tools to do this anything from, you know, paper and pencil all the way through to some some really, you know, somewhat high tech, searching software can be really helpful. Because this job is hard, no matter what you say, no matter what people say it's a hard job. And it's a lot of thinking. And so decreasing your cognitive load is a huge win. In that case. Thank you so much for talking. I really appreciate it. These slides are available at ISC GD slash SD f 20. That's SDF for Southern dev fest. My name is Samuel Taylor, I've loved getting to chat with you. Feel free to talk to me on Twitter, if you have questions. I, I hope this is relevant broadly. But if you have any specific questions about like data stuff, that's that's what I do. I'm a machine learning engineer. And so if you want to talk about machine learning, or AI or whatever, like, hit me up on Twitter, send me an email, I'd love to hear from you. One requests, if I can make any requests from you, this is the first time I've ever given this talk. And I want to make sure that it's as helpful as it possibly can be. And if you'd be willing to send me an email that just says one thing you liked, and one thing you didn't like about this talk, I would be immensely grateful for that. Thank you.&lt;/p&gt;
&lt;p&gt;38:11&lt;br /&gt;
Okay, thank you for the amazing talk. loved all of the information. We're gonna take a couple of minutes to answer some questions, if that's okay with you.&lt;/p&gt;
&lt;p&gt;38:20&lt;br /&gt;
I'd love that. Yeah, that sounds great. Sorry. And I'm gonna be looking over here because this is my like my cameras over here. But my notes over here, so I'll be able to see the questions over here.&lt;/p&gt;
&lt;p&gt;38:28&lt;br /&gt;
All good. All good. So the first question we have comes from Vanessa fountain, how do you approach a situation where you bring up a new way to do something and the other team members do not want to move forward with a newer solution, due to not having familiarity with new tech?&lt;/p&gt;
&lt;p&gt;38:42&lt;br /&gt;
Oh, Vanessa, if only I knew the answer to this question.&lt;/p&gt;
&lt;p&gt;38:45&lt;br /&gt;
Um,&lt;/p&gt;
&lt;p&gt;38:46&lt;br /&gt;
so there are a few things that I would recommend here. I think it's a, it's it's a really difficult spot that you're putting when you're in this situation. Because some, I think there is a lot of validity to using sort of stable, well established technology and not sort of chasing after the shiny new thing. But I also think there's there can be a lot of value in using some new tool that enables you to do something that you didn't realize you could one thing that I have seen people do in this kind of a situation, I try to build some sort of small prototype that demonstrates why this new technology might be valuable. And then that gives you a more concrete thing to talk about. Sometimes these conversations are way too abstract, and people struggle with that, or at least I struggle with it. And when we have a concrete example of like, okay, here's what, you know, this new framework allows us to do, we can see Oh, that's actually super nice. Or we can say Actually, this isn't as big of a witness. I thought it was gonna be&lt;/p&gt;
&lt;p&gt;39:47&lt;br /&gt;
that'll work or let me scroll up and down here on the comments section to make sure that we don't have anything else. I don't believe we do. We do have plenty of positive feedback though that was great Samuel photoreceptor says that Samuel Taylor first time didn't seem like it great. Alex, door lag Great job, Samuel. So Samuel, I want to thank you so much for your time here to get this information how to us all of these tips have been amazing. Would you do me a favor and posts that very last up curious. Let me post that up there. I'll throw this up there.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/how-to-join-a-new-team.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 07 Nov 2020 15:27:50 GMT</pubDate></item><item><title>Univariate k-Nearest Comparison (Trustworthy Models)</title><link>https://www.samueltaylor.org/articles/uncertainty-for-data-scientists.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;We often fly blind in the world of machine learning. Our model outputs an estimate for revenue from clicking on a certain ad, or the amount of time until a new edition of a book comes out, or how long it will take to drive a certain route. Typically these estimates are in the form of a single number with implicitly high confidence. Trusting these models can be foolish! Perhaps our models are high-tech con men -- Frank Abagnale reborn in the form of a deep neural net.&lt;/p&gt;
&lt;p&gt;If we want to call ourselves "Data Scientists", perhaps it is time to behave
like scientists do in other fields.&lt;/p&gt;
&lt;p&gt;I didn't study a natural science (like astronomy or biology), but I did take a
few physics classes with labs. One such lab required us to find the
gravitational constant by dropping a metal ball from a variety of heights.
During the experiment we were careful to record uncertainties in our
measurements, and we propagated uncertainty through to our final estimate of the
gravitational constant. If I had turned in a lab report claiming g = 9.12 m/s^2
(without any uncertainty estimate), I would have lost points.&lt;/p&gt;
&lt;p&gt;Good scientific measurements come with uncertainty. A ruler or measuring tape is
only so precise. When it comes to machine learning, though, this focus on
uncertainty disappears.&lt;/p&gt;
&lt;p&gt;For example, see this common formulation of the learning problem from the book
&lt;em&gt;Learning from Data&lt;/em&gt; &lt;a href="#fn0"&gt;[0]&lt;/a&gt; (no shade -- I love this book):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There is a target to be learned. It is unknown to us. We have examples
generated by the target. The learning algorithm uses these examples to look
for a hypothesis that approximates the target.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This unknown target function they call &lt;em&gt;f&lt;/em&gt;: &lt;em&gt;X&lt;/em&gt; -&amp;gt; &lt;em&gt;Y&lt;/em&gt; (where &lt;em&gt;X&lt;/em&gt; is the input/feature
space and &lt;em&gt;Y&lt;/em&gt; is the output/target space). The hypothesis approximating the
target they denote as &lt;em&gt;g&lt;/em&gt;: &lt;em&gt;X&lt;/em&gt; -&amp;gt; &lt;em&gt;Y&lt;/em&gt; (and, if our learning algorithm is successful,
we can say &lt;em&gt;g ≈ f&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;In the case of regression, the output of our learning algorithm is a function
which produces a continuous-valued output. But this output is a point estimate.
It has no sense of uncertainty &lt;a href="#fn1"&gt;[1]&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Others have of course noted the importance of uncertainty estimation before me.
One such person is José Hernández-Orallo (a professor at Polytechnic University
of Valencia), whose paper &lt;a href="https://dl.acm.org/doi/abs/10.1145/2641758"&gt;&lt;em&gt;Probabilistic reframing for cost-sensitive
regression&lt;/em&gt;&lt;/a&gt; I found while
researching &lt;a href="/articles/how-to-handle-class-imbalance.html"&gt;class
imbalance&lt;/a&gt;. I would
be misrepresenting his work to claim this paper is solely about
uncertainty/reliability estimation, but he describes some neat ideas
worth exploring.&lt;/p&gt;
&lt;p&gt;Rather than finding a function &lt;em&gt;g&lt;/em&gt;: &lt;em&gt;X&lt;/em&gt; -&amp;gt; &lt;em&gt;Y&lt;/em&gt; approximating the true underlying
function &lt;em&gt;f&lt;/em&gt;, we could instead seek to find a probability density function &lt;em&gt;h(y |
x)&lt;/em&gt;. In other words, a function to which we can still pass some features (&lt;em&gt;x&lt;/em&gt;) but
one describing a distribution instead of a point estimate. Because we are
estimating a probability density function conditioned on the input features,
this idea is called conditional density estimation.&lt;/p&gt;
&lt;p&gt;To hear Hernández-Orallo tell it, many methods for conditional density
estimation are suboptimal. The mean of the distributions they output is
typically worse than a point estimate would have been. They are often slow. And
in many cases, the distributions don't end up being multi-modal anyway. Thus,
the paper asserts we can get by with a method to provide a
normal (Gaussian) density function for most cases.&lt;/p&gt;
&lt;p&gt;Normal distributions are parametrized by a mean and a standard deviation. Taking
a point estimate (from any regression model) as the mean, we still need to
determine the standard deviation. The paper describes a few different approaches
for doing this (and is worth reading if you have the time). For the sake of this
post, I'll focus on a technique called "univariate k-nearest comparison". A
simple Python implementation follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;univariate_knearest_comparison&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_point&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;all_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;Q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Weight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_point&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Q&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;variance_estimate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;neighbors&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;variance_estimate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Hernández-Orallo describes this procedure as looking "for the closest
estimations in the training set to the estimation for example &lt;em&gt;x&lt;/em&gt;", then comparing
"their true values with the estimation for &lt;em&gt;x&lt;/em&gt;".&lt;/p&gt;
&lt;p&gt;This technique is cool because it can be applied to any regression model. By
using the training set (or a validation set) in this clever way, we can enrich
any model with the ability to estimate uncertainty (thus gaining the second
aspect of what we've &lt;a href="/articles/trustworthy-models.html"&gt;been calling trustworthy
models&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;If you've not read previous posts in this series, we've been working with a
dataset about fish. From their dimensions, we're trying to predict their weight.
It's totally a toy/unrealistic problem, but it's pedagogically useful.&lt;/p&gt;
&lt;p&gt;We'll start by training a linear model.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scale&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Length1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Length2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Length3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Height&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Width&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ohe&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Species&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, we'll run the univariate k-nearest comparison function from above.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;new_fish&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Species&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Bream&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Weight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;31.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;39.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Height&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;15.1285&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Width&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.5695&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;univariate_knearest_comparison&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_fish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# (646.1153309725989, 4896.726887004621)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With our prediction and variance estimate in hand, we can draw a normal
distribution.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;scipy.stats&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;st&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;coral&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;do_lims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;3.8&lt;/span&gt;
    &lt;span class="n"&gt;x_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;

    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;do_lims&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;ylo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yhi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yhi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Conditional density of fish weight given features&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plot_normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src="/static/img/uknc0.png"&gt;&lt;/p&gt;
&lt;p&gt;Here's a little graph with conditional density estimates for several different
fish on it.&lt;/p&gt;
&lt;p&gt;&lt;img src="/static/img/uknc1.png"&gt;&lt;/p&gt;
&lt;p&gt;This is a strong step in the direction of being more scientific in our modeling
efforts. We've examined a few methods for uncertainty estimation in this series,
and we'll evaluate the quality of these techniques at a later date.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;If you found this interesting, consider &lt;a href="https://twitter.com/SamuelDataT"&gt;following me on Twitter&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h3&gt;Footnotes&lt;/h3&gt;
&lt;ol start="0"&gt;
&lt;li id="fn0"&gt;Abu-Mostafa, Y. S., Magdon-Ismail, M., &amp;amp; Lin, H. (2012).
_Learning from data: A short course_. United States: AMLBook.com.&lt;/li&gt;
&lt;li id="fn1"&gt;I recognize that I'm equivocating on the word "uncertainty" to some extent.
Still, I think this is a useful idea even if only as an analogy.&lt;/li&gt;
&lt;/ol&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/uncertainty-for-data-scientists.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 01 Oct 2020 11:25:55 GMT</pubDate></item><item><title>new codebase, who dis? (How to Join a Team and Learn a Codebase)</title><link>https://www.samueltaylor.org/articles/how-to-learn-a-codebase.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;I have switched teams more often than I have had to implement an AVL tree, and you can guess which one of those two was taught in school. I wish someone had taught me how to join a new team! While learning a new codebase can be daunting, I've found a few things that work for me.&lt;/p&gt;
&lt;p&gt;You should do at least three things when joining a new team. The order of these three can be whatever you like, but all three should be done as soon as reasonably possible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, you’ll likely set up the development environment. As you do this, pay attention to just what it is that you're setting up. For instance, if you need to get Redis running locally, then that's a good hint that there's some caching happening somewhere. Noting the order in which you run internal projects helps you understand dependencies. If the feature store needs to be running before you bring up the model serving service, that's a hint that the model serving service may depend upon the feature store. Such dependencies start to hint at the overall architecture.&lt;/p&gt;
&lt;p&gt;Take notes on the exact commands you’re running and packages you’re installing. You’re bound to run into something that’s changed since the setup docs were written, and being able to correct them is a quick win you can provide to the new team. Plus, it's good to know exactly how you ruined your system installation of Python.&lt;/p&gt;
&lt;p&gt;Ideally, the code you're working on should have some sort of automated test suite in place. A good way to start experimenting with and understanding the code is to get that test suite successfully running, then make changes to the codebase completely at random and see what breaks.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;second&lt;/strong&gt; thing you should do is get some overview of the architecture. Some teams will have a document describing this, and if that document is an accurate depiction of reality then you should certainly work to understand it. In any case, asking a more senior person on the team to give you an overview is a good idea. They should know how up-to-date that document is (if it does exist) and also be able to describe and/or draw the architecture for you. Here are some sample questions you can consider asking:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What repositories (or portions of repositories) do we own and/or work on most frequently? What do each of them do?&lt;/li&gt;
&lt;li&gt;Where does our code run? (e.g. EC2 Instances, Google's Kubernetes Engine, on prem)&lt;/li&gt;
&lt;li&gt;What does the deployment pipeline look like? How does a feature get from my laptop to live in production?&lt;/li&gt;
&lt;li&gt;Do we have certain services, packages, classes, or files that are a real headache? Areas that are particularly unreliable or error-prone?&lt;/li&gt;
&lt;li&gt;Are there external API's, vendors, or products that we use or rely on? (e.g. SendGrid, DataDog, MySQL)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Similar to environment setup, an easy thing you can do to help the team is to document that architectural overview (or update the existing document). Write down what you learned, take a picture of the diagrams that were drawn, and post that information somewhere visible to the team. Be sure to put an "as of" date on your changes. Even stable projects exhibit some change over a long enough time period, so this date will help future readers know if they can trust this document.&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;third&lt;/strong&gt; thing you should do when starting on a new team is start understanding the business. If you're new to the company, figure out its mission, product offering(s), and goal(s). Then work to understand how your team fits into those things. Some sample questions include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How can our team make an impact on the company's goals?&lt;/li&gt;
&lt;li&gt;If our code were to break horrifically, who would get angry? How fast would that happen?&lt;/li&gt;
&lt;li&gt;What other teams do we have the most interaction with? What services/codebases do they own? Do we share parts of our codebase with other teams?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without an understanding of the team's place in the company, you're doomed. You won't have sufficient context to execute your work well.&lt;/p&gt;
&lt;h2&gt;Mindset&lt;/h2&gt;
&lt;p&gt;I strongly believe that learning a new codebase happens best through implementing real features (even if they are small to start with). The whole point of being on this team as an individual contributor is to build stuff, and there is no better way to learn how to do something than by spending quality time doing that exact thing. As you build skill and understanding, you can work on larger and larger projects over time.&lt;/p&gt;
&lt;p&gt;Implementing something will require you to read the code. But "read" may be a misleading word here, because reading code is dramatically different from reading a novel. Code is typically organized with more related code being closer together (in the same directory, package, class, or file). Can you imagine a novel written in this way? If Tolkien had placed all scenes of two characters fighting each other in adjacent pages, while all scenes with magic in them occurred in a separate book? How absurd!&lt;/p&gt;
&lt;p&gt;Though learning to code taught me the basics of reading code, nobody ever taught me how to read a large codebase. To do so, we must adopt a certain mindset. Balance understanding each intricate detail against making impact quickly. Quick impact helps establish your reputation on the team and gets you to that accurate/intricate understanding faster than trying to read everything up front.&lt;/p&gt;
&lt;p&gt;The rule of thumb I use is to understand something just enough to express what it does without necessarily knowing exactly how it does that. This process is called "chunking," and it relies on the fact that once you have a basic understanding of a unit of code, "you don't need to remember all the little underlying details" (Oakley). If you're worried about not understanding everything in minute detail, don't be afraid to take a note to come back and understand that chunk more fully.&lt;/p&gt;
&lt;p&gt;This understanding will grow recursively: first, you understand what the various services do. Then, you identify the particular service you need to modify and start to understand the various modules within that service. In the modules you modify, you'll start to understand the classes contained. The base case of this recursive process is the individual line.&lt;/p&gt;
&lt;p&gt;Keep in mind that different teams may implement the same concept or pattern in different ways. Understanding why your current team chose the way they did is another way new teammates can help the team. It's totally possible that your new team hasn't heard of the cool way to implement singletons that you like. It's equally possible that your way is worse in some way you didn't know. Either way, someone gets to learn something!&lt;/p&gt;
&lt;p&gt;The last mindset recommendation I'll give before we dive into the process is to try to understand the code both in terms of code paths and data flows. Think about which objects know what information and how that information flows between parts of the system.&lt;/p&gt;
&lt;h2&gt;Process&lt;/h2&gt;
&lt;p&gt;I recommend this process for working in any codebase:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Locate the portion of code most relevant to the immediate task&lt;/strong&gt; at hand.&lt;/li&gt;
&lt;li&gt;Understand that code enough to &lt;strong&gt;form a hypothesis&lt;/strong&gt; about the change you need to make.&lt;/li&gt;
&lt;li&gt;Make that change and &lt;strong&gt;test your hypothesis&lt;/strong&gt;. Sometimes the best way will be to click around in the UI or run a particular script. Sometimes the easiest path is to write a test that describes the behavior you're after.&lt;/li&gt;
&lt;li&gt;If your hypothesis was incorrect, return to step 2. Understand why that change didn't do what you thought it would, and develop a new hypothesis.&lt;/li&gt;
&lt;li&gt;Once you have working code, &lt;strong&gt;improve its quality&lt;/strong&gt;. Write a test (or a few) that document the changes in behavior you made. Refactor your code for clarity and style.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This scientific approach guides us gradually toward correct, high quality code without having to understand each and every bit of code around our change.&lt;/p&gt;
&lt;h2&gt;Tools&lt;/h2&gt;
&lt;p&gt;While you could certainly get by with just a text editor and some patience, a wide variety of tools exist that help us read code more effectively throughout the process identified above.&lt;/p&gt;
&lt;h3&gt;Identifying relevant code&lt;/h3&gt;
&lt;p&gt;While step one gets easier over time as we build familiarity with some portion of code, we often begin step one completely lost. A few approaches are helpful here: running the code, project search, and code search.&lt;/p&gt;
&lt;p&gt;Running the code helps you understand it. Before you start changing things, understand what already exists. This could mean reproducing a bug locally, finding the place in the UI where the new feature will go, or any number of other things. When you do, stepping through the execution in a debugger will give you a strong start on understanding what is going on.&lt;/p&gt;
&lt;p&gt;By "project search," I mean searching artifacts created as part of the software development lifecycle. Particularly useful are issue trackers like JIRA/Asana/Pivotal Tracker, pull requests and issues in tools like GitHub and GitLab, and the git history itself. Because few tasks are truly novel, we can often gain understanding by looking for similar past work. Try several different keywords. Sometimes you'll find a pull request that implements something very similar to what you want to do, and you can use that as a guide. Trying to divine something from scratch, while sometimes necessary, requires significantly more effort than adapting from an example.&lt;/p&gt;
&lt;p&gt;Code search is just what it sounds like. For code that you have checked out locally, I highly recommend using a tool specifically built for recursive search like ack, Silver Searcher (ag), or ripgrep. But you won't always have every bit of code at the company checked out locally, and sometimes it's useful to be able to search exhaustively. For this use case, tools like OpenGrok or Sourcegraph are super helpful. GitHub and GitLab also offer ways to search all code within a specific organization.&lt;/p&gt;
&lt;p&gt;No matter which tool you're using, try several keywords you think might be relevant. Consider changing case sensitivity. You may have better results filtering down to specific file types.&lt;/p&gt;
&lt;h3&gt;Understanding code&lt;/h3&gt;
&lt;p&gt;Using these various search tools, we arrive at a set of relevant locations. Thus, we arrive into step two of our process: understanding the code just well enough to form a hypothesis about the necessary change. The search tools we've already discussed are helpful to this end (if you come across usage of an unfamiliar class, search for it and read what you find).&lt;/p&gt;
&lt;p&gt;One other tool that is incredibly useful is a good IDE. I like JetBrains' products (I have no affiliation with them), though I'm sure similar functionality exists in competing products. JetBrains IDE's can help you navigate code much more efficiently by linking you straight through to the definition of a function or class. By default on Macs, hold down Cmd and hover over the function or class name, then click. Being able to immediately jump to the definition is a complete game changer.&lt;/p&gt;
&lt;p&gt;Another super-useful JetBrains keyboard shortcut is (by default) tapping shift twice. This brings up a search bar that can find just about anything (classes, functions, file names).&lt;/p&gt;
&lt;p&gt;As you read code, always try to decrease your cognitive load. Remember to create "chunks", mental boxes inside of which you don't need to remember all the details. Consider taking notes, writing down file names and line numbers, drawing little diagrams. Reading and writing code is the most cognitively demanding part of the job, so take any chance you can get to make it easier for yourself.&lt;/p&gt;
&lt;p&gt;You may get stuck or lost during this process. It is OK to ask for help. Use &lt;code&gt;git blame&lt;/code&gt; to see who has been working on some bit of code you find confusing, and ask them about it. You can also use &lt;code&gt;git blame&lt;/code&gt; to find relevant pull requests or  JIRA tickets that might help you gain context.&lt;/p&gt;
&lt;h3&gt;Working with libraries&lt;/h3&gt;
&lt;p&gt;Sometimes as part of step three, we will need to work with an external library. In an ideal world, all libraries have excellent documentation that helps you understand the key abstractions and be productive quickly. Alas, we do not live in an ideal world! Many projects do have good documentation. But others may be more easily learned through the broader community. Consider searching the web with a tool like DuckDuckGo or Google. See if there are examples on StackOverflow.&lt;/p&gt;
&lt;p&gt;A recent lightbulb moment for me was realizing that GitHub allows users to search all public code. Consequently, we can find realistic examples of people using libraries and API's that we care about. Try searching for the particular method name you're trying to use. Or search for the name of the package, then search within individual repositories that come up. Consider filtering to just the language you care about.&lt;/p&gt;
&lt;p&gt;Anecdotally I have found that sorting GitHub search by "recently indexed" gives me more diverse, more helpful results than the default search (which largely gives me the same copy-pasted examples over and over again). If you're unhappy with your results, do try different sort orders.&lt;/p&gt;
&lt;h2&gt;Parting words&lt;/h2&gt;
&lt;p&gt;Not only do we learn faster when we orient that learning around real tickets, but we simultaneously make an impact and start building reputation on the team. By taking advantage of prior work (and using good tools to find that work) we can accelerate our learning and our impact. Know that while joining a new team is non-trivial, it doesn't have to be hard! Use the scientific method. Follow these practices. Take a look at these tools. You'll gain confidence in your abilities and make a good first impression while you're at it.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;If you found this interesting, consider &lt;a href="https://twitter.com/SamuelDataT"&gt;following me on Twitter&lt;/a&gt;. Thanks to my friend Benjamin Cody for providing feedback on this post.&lt;/em&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Update 2021-01-16:&lt;/em&gt; @Coding_Career on Twitter made an awesome "cheat
sheet" from this post &lt;a href="https://twitter.com/Coding_Career/status/1350445944395821056"&gt;available here&lt;/a&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Citations:&lt;/p&gt;
&lt;p&gt;Oakley, Barbara A. &lt;em&gt;A Mind for Numbers: How to Excel at Math and Science (Even If You Flunked Algebra)&lt;/em&gt;. Jeremy P. Tarcher/Penguin, 2014.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/how-to-learn-a-codebase.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 06 Sep 2020 21:06:42 GMT</pubDate></item><item><title>Lightweight testing for maintainable data science</title><link>https://www.samueltaylor.org/articles/lightweight-testing-for-data-scientists.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;When I began working in analytics, one of the most miserable types of tasks I ended up doing was re-running an old Jupyter notebook. Often it failed part way through with some inscrutable error. Figuring out what was going on was challenging; how am I supposed to remember this particular notebook from five months ago? What's more, the underlying data sometimes stops getting updated, or a column name changes, or the date format in a particular field switches. You may have had similarly frustrating experiences. The good news is that simple techniques from the field of software engineering can dramatically improve this experience.&lt;/p&gt;
&lt;p&gt;As you may have guessed from the title of this article, I'm a big fan of testing. It's easier than you realize, and it'll save you a ton of headaches. For our purposes today, let's consider a machine learning project that consists of three phases: first, exploratory data analysis and prototyping. Second, model training. And third, running in production. All three of these phases can benefit from testing.&lt;/p&gt;
&lt;h2&gt;One: EDA and prototyping&lt;/h2&gt;
&lt;p&gt;When exploring the data, we learn a significant amount of information. Here are some examples of questions we might answer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How many columns are there in this dataset?&lt;/li&gt;
&lt;li&gt;What are the names of each column?&lt;/li&gt;
&lt;li&gt;What data type does each column contain?&lt;/li&gt;
&lt;li&gt;For string columns, how many unique values exist?&lt;/li&gt;
&lt;li&gt;For numeric columns, what range does the data fall into?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Too often, we keep the answers to these questions in our head alone. This fact is part of what makes it difficult to go back to an old notebook; these answers have fallen out of our short- and long-term memory by the time we return to the notebook. Fortunately for us, computers have excellent memories! We could, of course, write down each of the answers to these questions directly in our Jupyter notebook, which will help us when we return to it. Still better, though, is expressing the answers to these questions as executable code -- as tests.&lt;/p&gt;
&lt;p&gt;When doing initial analysis, I find it cumbersome to even think about running a testing framework inside my notebook. Fortunately, we can get by without one: Python includes the &lt;code&gt;assert&lt;/code&gt; keyword, which will do just fine. For example, we might encode the knowledge that our DataFrame should have 8 columns thusly:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;assert df.shape[1] == 8&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This is an improvement over a comment or markdown cell that simply states "DataFrame should have 8 columns" because the computer will actually check this for us each time the notebook is run. And if that condition is not met, we will see an error:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
&amp;lt;ipython-input-13-ed79b70114d8&amp;gt; in &amp;lt;module&amp;gt;
----&amp;gt; 1 assert df.shape[1] == 8

AssertionError:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, this may be an acceptable error. We can read the condition that was asserted and back into the conclusion that our DataFrame should have eight columns. But if we're feeling quite charitable toward our future self, we can add a message:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;assert df.shape[1] == 8, "Expected 8 columns"&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;which, assuming the condition is not true, will result in this error:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
&amp;lt;ipython-input-14-18deb3201a98&amp;gt; in &amp;lt;module&amp;gt;
----&amp;gt; 1 assert df.shape[1] == 8, "Expected 8 columns"

AssertionError: Expected 8 columns
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Writing an &lt;code&gt;assert&lt;/code&gt; statement is a cheap insurance policy against unexpected changes. I highly recommend making assertions about the shape of your dataset, the sparsity of certain columns (&lt;code&gt;assert df['a'].notnull().mean() &amp;gt; 0.9&lt;/code&gt;), the existence of particularly important columns (&lt;code&gt;assert 'age' in df&lt;/code&gt;), and the range of numeric columns (&lt;code&gt;assert (df['age'] &amp;lt; 0).sum() == 0&lt;/code&gt;). As a general rule, if you're making an assumption in your code, consider whether you can express that assumption as an assert statement.&lt;/p&gt;
&lt;h2&gt;Two: training script&lt;/h2&gt;
&lt;p&gt;A common pattern I've seen in machine learning work is to take a Jupyter notebook that contains code to train a model and turn it into a Python script (which is more easily run/monitored in certain environments). To do this, I recommend taking chunks of the notebook which do a discrete unit of work and turning them into standalone functions that the notebook then uses. Specifically, create a &lt;code&gt;.py&lt;/code&gt; script in the same directory as the notebook (say, &lt;code&gt;helpers.py&lt;/code&gt;), define a new function, and copy the code from the notebook into that function. Then, import the function (for example, &lt;code&gt;from helpers import age_range_to_midpoint&lt;/code&gt;), delete the code you pasted into the script, and use the function instead.&lt;/p&gt;
&lt;p&gt;As an example, suppose our data encodes age as a range ("0-25", "25-40", "40-100"), and we have decided that we want to represent this to our model with the midpoint of the range. Our &lt;code&gt;helpers.py&lt;/code&gt; script might contain the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;age_range_to_midpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age_range&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;endpoints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;age_range&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;-&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;At this point, I believe it's worth it to use a testing framework. Python has one built in, but I love using &lt;a href="https://docs.pytest.org/en/stable/"&gt;pytest&lt;/a&gt;. As we create functions, we can add tests by defining a function (or functions) whose name(s) begin with "test_":&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_age_range&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;age_range_to_midpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;20-30&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;age_range_to_midpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;0-31&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;15.5&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Just like the asserts we created during EDA encode information about our data, these tests encode information about how our code works. By the end of this process, we have a nice file of functions and a notebook which largely runs those functions in a certain order. Turning this notebook into a Python script is now simple, as the complex logic is already present in our helper file.&lt;/p&gt;
&lt;p&gt;We can run our tests with a simple command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pytest helpers.py
================================== test session starts ===================================
platform darwin -- Python 3.8.5, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /my/cool/project
collected 1 item

helpers.py .                                                                       [100%]

=================================== 1 passed in 0.00s ====================================
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These tests enable us to make changes to our code more confidently. We can run these tests ourselves after changes we've made to make sure we haven't broken anything. Ideally, we can set up some sort of automated process that runs these tests as commits are made (both GitLab and GitHub offer tools that do this).&lt;/p&gt;
&lt;p&gt;Further, these tests serve as executable documentation. While it is easy for comments to go stale, tests remain an accurate description of what a function does. If I introduce a change to the way a function works, I must also edit the tests (or else they will fail, and I will be sad). In this way, tests are a far more reliable and accurate kind of documentation than comments.&lt;/p&gt;
&lt;h2&gt;Three: production&lt;/h2&gt;
&lt;p&gt;While a thorough treatment of putting a model in production is outside the scope of this article, testing is certainly a part of it. In his book &lt;em&gt;Building Machine Learning Powered Applications&lt;/em&gt;, Emmanuel Ameisen coins the term "check" to describe a test that runs in the production prediction pipeline (rather than in a CI/CD pipeline). The same kinds of common sense &lt;code&gt;assert&lt;/code&gt; statements you wrote in your Jupyter notebook are also helpful sanity checks in a prediction pipeline.&lt;/p&gt;
&lt;p&gt;You should write checks for both inputs and outputs of your model. Is someone passing in a negative value for the age of a human being? Is our model predicting that a car will have a fuel efficiency of over 9,000 miles per gallon? Both of these cases seem unexpected! Depending on the business requirements, we may take a variety of actions. For instance, if our model is predicting a huge value for miles per gallon, we might refuse to make a prediction:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;PredictionError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Problem predicting mpg for this car&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In other cases, we may be able to use a heuristic:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Sometimes, we may be able to swap in a simpler model if it's available and more robust. Or we can replace nonsensical feature values for nulls, or impute a value. There's a lot of options here, and you should be careful about choosing the right one for your use case. A well-written check prevents a certain class of bug from becoming an issue, thereby improving the robustness of the system overall.&lt;/p&gt;
&lt;h2&gt;Go forth and test&lt;/h2&gt;
&lt;p&gt;Keep in mind how you can introduce testing throughout your process. Whether it's a quick &lt;code&gt;assert&lt;/code&gt; statement in a Jupyter notebook, a unit test in a Python script, or a check that runs in production, well-written tests are a gift to your future self and your team. Tests make code less error prone, easier to debug, and less vulnerable to decay.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/lightweight-testing-for-data-scientists.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 20 Aug 2020 00:45:51 GMT</pubDate></item><item><title>APChemSolutions review - do not buy</title><link>https://www.samueltaylor.org/articles/apchemsolutions-review.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;&lt;em&gt;Author's note: this is not what I usually write about. If you're an educator
considering purchasing a product from apchemsolutions.com, please read on!
Otherwise, don't worry about it.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Executive summary&lt;/h2&gt;
&lt;p&gt;An educator I know recently contacted me for some IT help with a product they
had purchased from apchemsolutions.com (AKA AP Chem Solutions). In helping this
person, I was appalled by the level of contempt the creator(s) of this product
have for their users. &lt;strong&gt;I strongly recommend against purchasing anything from AP
Chem Solutions&lt;/strong&gt; (which uses the domain apchemsolutions.com).&lt;/p&gt;
&lt;h2&gt;More detail&lt;/h2&gt;
&lt;p&gt;I completely understand that companies have a right to protect their
intellectual property. However, AP Chem Solutions chooses to use DRM (digital
rights management) software which harms its customers. Here are a few reasons I
think their product is bad:&lt;/p&gt;
&lt;h3&gt;1. Jumping through hoops&lt;/h3&gt;
&lt;p&gt;The educator who reached out to me receives an email with instructions on how to
open the PDF files in the provided ZIP file. First, the user must have Adobe
Reader installed. Then, there is a set of six to seven items which describe how
to install a certificate on your computer. Then, there are a further 5 steps to
get Adobe Reader set up to actually read the PDF's.&lt;/p&gt;
&lt;p&gt;I have a degree in computer science, and I found these steps to be frustrating
and annoying. It is no wonder that this person needed help! While I have immense
respect for teachers, the amount of hoops this company expects them to jump
through is beyond anything I would expect a teacher at any level to accomplish
on their own.&lt;/p&gt;
&lt;h3&gt;2. Administrative access&lt;/h3&gt;
&lt;p&gt;Which brings us to the next point: admin access. To import this certificate in
the first place, the customer is expected to enter the admin password for the
computer. This might be all fine and dandy, except that it is rare for teachers
to have admin access to the computers provided them by the school.&lt;/p&gt;
&lt;p&gt;You know how much of a pain it is to interact with your IT department? Well, if
you want to use this product, you're going to have to bug IT. And they might not
want to help you with this. It's a very real possibility that your request will
get denied and you will be completely unable to use the product you purchased
from AP Chem Solutions.&lt;/p&gt;
&lt;h3&gt;3. Small annoyances&lt;/h3&gt;
&lt;p&gt;All of my other concerns are more important than any of the minor things I'm
listing here, but I wanted to list them:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Instructions are at times unclear or poorly written. I can easily see someone
  getting confused trying to follow them.&lt;/li&gt;
&lt;li&gt;Instructions are at times factually inaccurate (e.g. you don't need to set
  your default PDF reader to Adobe Reader)&lt;/li&gt;
&lt;li&gt;You cannot use these materials on operating systems other than Windows or
  macOS (because Adobe Reader isn't available on other platforms)&lt;/li&gt;
&lt;li&gt;They require you to disable some security features of Adobe Reader. I am not a
  cybersecurity buff, but I imagine this exposes your computer to additional
  risk of viruses and/or malware.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Printing restrictions&lt;/h3&gt;
&lt;p&gt;Finally, the reason this company has had you jump through all these hoops in the
first place: to keep you from printing or redistributing certain portions of
the materials they have provided you. Would you like to print a copy of the
slides for your students or a sub? You can't!&lt;/p&gt;
&lt;p&gt;And forget about sharing the PDF's digitally; anyone you send them to will have
to go through the same setup process you did. Good luck explaining that to them.&lt;/p&gt;
&lt;h2&gt;An alternative&lt;/h2&gt;
&lt;p&gt;Rather than use restrictive, draconian DRM software, AP Chem Solutions should
put its customers first. The company should provide its customers what they
believe they have purchased: access to materials that will help them make
students more successful free from restrictive DRM. Teachers, who are already
overburdened, &lt;em&gt;should not have to have a four year degree in computers just to
use a product they have purchased&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;To reiterate: &lt;strong&gt;please do not buy anything from AP Chem Solutions&lt;/strong&gt;, which
operates on the domain name apchemsolutions.com. It appears they care more about
lining their pockets than they do about your ability to actually use the product
you purchased.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/apchemsolutions-review.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 13 Aug 2020 02:15:31 GMT</pubDate></item><item><title>Model-Agnostic Uncertainty Estimates through Bootstrapping</title><link>https://www.samueltaylor.org/articles/uncertainty-with-bootstrap.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;A key element of a &lt;a href="trustworthy-models.html"&gt;trustworthy model&lt;/a&gt; is that it can
give an estimate of its confidence in a given prediction. We've already talked
about one way to do this for &lt;a href="trustworthy-linear-models.html"&gt;linear models&lt;/a&gt;,
and today we'll talk about a technique for getting uncertainty estimates for any
model.&lt;/p&gt;
&lt;p&gt;Let's continue using the &lt;a href="https://www.kaggle.com/aungpyaeap/fish-market"&gt;fish dataset&lt;/a&gt;
from last time:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;fish&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expanduser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;~/Downloads/Fish.csv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We build a &lt;code&gt;ColumnTransformer&lt;/code&gt; for convenience:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.compose&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OneHotEncoder&lt;/span&gt;

&lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;scale&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Length1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Length2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Length3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Height&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Width&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ohe&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Species&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Next we construct a pipeline which uses the &lt;code&gt;ColumnTransformer&lt;/code&gt; from above as
well as &lt;code&gt;scikit-learn&lt;/code&gt;'s implementation of bagging. Specifically, our
&lt;code&gt;BaggingRegressor&lt;/code&gt; will consist of 100 ElasticNet models, each one trained on a
random 25% of the dataset (with replacement).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaggingRegressor&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;lm&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BaggingRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ElasticNetCV&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Weight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, we can snag those 100 models and make a prediction for a new fish:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaggingRegressor&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;lm&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BaggingRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ElasticNetCV&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Weight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;new_fish&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Species&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Bream&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Weight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;31.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;39.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Height&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;15.1285&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Width&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.5695&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_fish&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;estimators&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;savefig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;twm1_hist.png&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bbox_inches&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Which gives us a nifty histogram of expected weight:&lt;/p&gt;
&lt;p&gt;&lt;img src="/static/img/twm1_hist.png"&gt;&lt;/p&gt;
&lt;p&gt;The cool thing about this approach, though, is that we can swap in any model
within the &lt;code&gt;BaggingRegressor&lt;/code&gt;, and the rest of the code is unaffected. For
instance, here's the distribution of predictions when using decision trees:&lt;/p&gt;
&lt;p&gt;&lt;img src="/static/img/twm1_hist1.png"&gt;&lt;/p&gt;
&lt;p&gt;Interesting idea, right? There's still a few more approaches I want to highlight
in coming posts, but after that I'll be comparing them all to see which
uncertainty estimation technique is best.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Comments? Questions? Concerns? Please tweet me
&lt;a href="https://twitter.com/SamuelDataT"&gt;@SamuelDataT&lt;/a&gt; or email me. Thanks!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/uncertainty-with-bootstrap.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Fri, 17 Jul 2020 00:49:51 GMT</pubDate></item><item><title>Trustworthy Models in Practice: a Simple Linear Approach</title><link>https://www.samueltaylor.org/articles/trustworthy-linear-models.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Last time, we began to talk about how to build models worthy of our users'
trust. As a refresher, we said that trustworthy models require at least three
things:&lt;/p&gt;
&lt;!-- TODO add CSS for blockquotes --&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Prediction -- An estimate for some unknown value&lt;/li&gt;
&lt;li&gt;Confidence -- A description of how uncertain the model is about the prediction&lt;/li&gt;
&lt;li&gt;Explanation -- The reasoning for which a model made its prediction&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;Today, we'll take a pass at actually implementing such a model.&lt;/p&gt;
&lt;h2&gt;Dataset&lt;/h2&gt;
&lt;p&gt;For pedagogical reasons, we're using a &lt;a href="https://www.kaggle.com/aungpyaeap/fish-market"&gt;dataset on
fish&lt;/a&gt; that were sold at a fish
market. Here's a few rows from the dataset:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;| Species | Weight | Length1 | Length2 | Length3 | Height  | Width  |
|---------|--------|---------|---------|---------|---------|--------|
| Perch   | 250.0  | 25.9    | 28.0    | 29.4    | 7.8204  | 4.2042 |
| Bream   | 714.0  | 32.7    | 36.0    | 41.5    | 16.517  | 5.8515 |
| Perch   | 145.0  | 22.0    | 24.0    | 25.5    | 6.375   | 3.825  |
| Perch   | 145.0  | 20.7    | 22.7    | 24.2    | 5.9532  | 3.63   |
| Bream   | 975.0  | 37.4    | 41.0    | 45.9    | 18.6354 | 6.7473 |
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first step, of course, is to load it up!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;fish&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expanduser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;~/Downloads/Fish.csv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Building a model&lt;/h2&gt;
&lt;p&gt;For our exercise today, let's see if we can predict &lt;code&gt;Weight&lt;/code&gt; given the values of
the other columns. We're going to use &lt;code&gt;statsmodels&lt;/code&gt; to build a simple linear
model.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;statsmodels.formula.api&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;smf&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;smf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ols&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;formula&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Weight ~ C(Species) + Length2 + Length2 + Length3 + Height + Width&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fish&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you've never used &lt;code&gt;statsmodels&lt;/code&gt; before, think of this as fitting a linear
model, with &lt;code&gt;Species&lt;/code&gt; being one-hot encoded. &lt;code&gt;statsmodels&lt;/code&gt; has a nice way of
getting basic information about the model:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;                            OLS Regression Results
==============================================================================
Dep. Variable:                 Weight   R-squared:                       0.936
Model:                            OLS   Adj. R-squared:                  0.931
Method:                 Least Squares   F-statistic:                     195.7
Date:                Sun, 14 Jun 2020   Prob (F-statistic):           6.85e-82
Time:                        15:00:23   Log-Likelihood:                -941.46
No. Observations:                 159   AIC:                             1907.
Df Residuals:                     147   BIC:                             1944.
Df Model:                          11
Covariance Type:            nonrobust
===========================================================================================
                              coef    std err          t      P&amp;gt;|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                -918.3321    127.083     -7.226      0.000   -1169.478    -667.186
C(Species)[T.Parkki]      164.7227     75.699      2.176      0.031      15.123     314.322
C(Species)[T.Perch]       137.9489    120.314      1.147      0.253     -99.819     375.717
C(Species)[T.Pike]       -208.4294    135.306     -1.540      0.126    -475.826      58.968
C(Species)[T.Roach]       103.0400     91.308      1.128      0.261     -77.407     283.487
C(Species)[T.Smelt]       446.0733    119.430      3.735      0.000     210.051     682.095
C(Species)[T.Whitefish]    93.8742     96.658      0.971      0.333     -97.145     284.893
Length1                   -80.3030     36.279     -2.214      0.028    -151.998      -8.608
Length2                    79.8886     45.718      1.747      0.083     -10.461     170.238
Length3                    32.5354     29.300      1.110      0.269     -25.369      90.439
Height                      5.2510     13.056      0.402      0.688     -20.551      31.053
Width                      -0.5154     23.913     -0.022      0.983     -47.773      46.742
==============================================================================
Omnibus:                       43.558   Durbin-Watson:                   0.973
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               97.422
Skew:                           1.184   Prob(JB):                     7.00e-22
Kurtosis:                       6.016   Cond. No.                     2.03e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.03e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this point, we can achieve our first objective: to provide a prediction!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;new_fish&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Species&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Bream&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Weight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;31.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Length3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;39.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Height&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;15.1285&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;Width&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.5695&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_fish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This model predicts this fish weighs about 646 grams.&lt;/p&gt;
&lt;h2&gt;Providing uncertainty&lt;/h2&gt;
&lt;p&gt;The main reason I've chosen to use statsmodels (rather thank scikit-learn) is
that it provides built-in support for &lt;a href="https://en.wikipedia.org/wiki/Prediction_interval"&gt;prediction
intervals&lt;/a&gt;. Take a look:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_prediction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_fish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;| mean   | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper |
|--------|---------|---------------|---------------|--------------|--------------|
| 646.12 | 18.32   | 644.96        | 647.27        | 640.11       | 652.12      |
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;mean&lt;/code&gt; here is the prediction, and a 95% prediction interval is provided by
&lt;code&gt;obs_ci_lower&lt;/code&gt; and &lt;code&gt;obs_ci_upper&lt;/code&gt;. In other words, our model thinks the weight
of this fish is between 640 and 652 grams with 95% probability.&lt;/p&gt;
&lt;p&gt;We're two thirds of the way there!&lt;/p&gt;
&lt;h2&gt;Providing an explanation&lt;/h2&gt;
&lt;p&gt;We can use the structure of the model to provide an explanation. The prediction is equal to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  -918          (the intercept)
-   80.3 * 31.3 (Length1)
+   79.9 * 34   (Length2)
+   32.5 * 39.5 (Length3)
+    5.3 * 15.1 (Height)
-    0.5 *  5.6 (Width)
   ------------
   646.12
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A way we might display how the various features contribute to the overall
prediction is this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fish_to_feats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_fish&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;feats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a_fish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Intercept&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;species_feat&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;species_feat&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C(Species)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;species&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;species_feat&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;]&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# This is ugly&lt;/span&gt;
        &lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;species_feat&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Species&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;species&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Species&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="n"&gt;contributions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fish_to_feats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_fish&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;contributions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iteritems&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1e-3&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;1e3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;{name}: {amount[0]}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Which provides the following output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Length2: 2716.2&lt;br /&gt;
Length1: -2513.5&lt;br /&gt;
Length3: 1285.1&lt;br /&gt;
Intercept: -918.3&lt;br /&gt;
Width: -2.9&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This could certainly be made more user friendly, but it does give some kind of
explanation for why the model believes this fish to weigh 646 grams.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We've built a model that can provide trustworthy predictions. For example:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;My best guess at the weight of this Bream is 646g.&lt;/li&gt;
&lt;li&gt;With 95% probability, the weight is between 640g and 652g.&lt;/li&gt;
&lt;li&gt;The biggest contributors to this prediction are &lt;code&gt;Length2&lt;/code&gt; (pushes the
   prediction higher), &lt;code&gt;Length1&lt;/code&gt; (pushes it lower), and &lt;code&gt;Length3&lt;/code&gt; (pushes it
   higher).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I highly recommend attacking machine learning problems by starting with an
incredibly simple model first. Implementing that end-to-end enables focus on the
truly difficult parts of machine learning (i.e. &lt;em&gt;not&lt;/em&gt; the ML bits). For some use
cases, this post provides yet another reason to love linear models: they are
trustworthy by default!&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Comments? Questions? Concerns? Please tweet me
&lt;a href="https://twitter.com/SamuelDataT"&gt;@SamuelDataT&lt;/a&gt; or email me. Thanks!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/trustworthy-linear-models.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 14 Jun 2020 21:07:07 GMT</pubDate></item><item><title>Building Trustworthy Models</title><link>https://www.samueltaylor.org/articles/trustworthy-models.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Machine learning has a trust problem. Discussions about the role that algorithms
play in our lives have become national (if not global), with some raising
important and legitimate questions about the biases inherent in these
algorithms. In this environment, we wonder: what would it take for a model to be
worthy of our trust?&lt;/p&gt;
&lt;p&gt;I recently read an illuminating piece by David Spiegelhalter called "Should We
Trust Algorithms?". In it, he identifies the difference between trustworthy
claims about a system and trustworthy claims made by a system. His article
spends more time on the former than the latter, so I've written this article to
elaborate on ways our models can make more trustworthy claims.&lt;/p&gt;
&lt;h2&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Building trust with users is essential for a few reasons. First, we want our
products to be used. If a user doesn't trust the predictions made by my model,
she is less likely to follow its advice. Worse, without adequately communicating
uncertainty, we may actively anger her. Suppose a model predicts this user's
house will sell for $300,000, but it ends up selling for $290,000. It's
difficult to fault her for being upset at this $10,000 difference.&lt;/p&gt;
&lt;p&gt;By contrast, if we predicted a range of possible sale values, the user would
have better expectations going in and a better experience with our product.&lt;/p&gt;
&lt;p&gt;Ethics provide a second reason that building trust is paramount. It is unethical
to present estimates without a sense for their uncertainty. A common machine
learning approach is to build some classification or regression model for a
problem. These models typically output a single predicted value: "this flower is
a setosa", "this house is worth $300K", or "this image has an airplane in it".
These statements imply a level of certainty that may be unwarranted by the data,
and we must be very careful to place them into context to avoid dishonesty.&lt;/p&gt;
&lt;p&gt;Finally, trustworthy models are just good business. If we hide uncertainty with
overly-precise point estimates, we are likely to make bad decisions. Annie Duke
writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A great decision is the result of a good process, and that process must
include an attempt to accurately represent our own state of knowledge. That
state of knowledge, in turn, is some variation of “I’m not sure.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Trustworthy models in practice&lt;/h2&gt;
&lt;p&gt;But building trust is hard. We cannot simply tell our users to "trust the
algorithm" and expect them to do so. Instead, Spiegelhalter argues, we should
put our efforts into building models that are worthy of trust. When we relate to
other humans, we understand that we must demonstrate that we are worthy of trust
before we will be trusted. The same holds true for models.&lt;/p&gt;
&lt;p&gt;A coworker of mine once asserted that trustworthy models provide at least three
things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Prediction -- An estimate for some unknown value&lt;/li&gt;
&lt;li&gt;Confidence -- A description of how uncertain the model is about the
   prediction&lt;/li&gt;
&lt;li&gt;Explanation -- The reasoning for which a model made its prediction&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As an example, consider a doctor. If your doctor told you that your arm needed
to be amputated, you'd never let them do it just based on that recommendation!
You would ask for some justification first. She would explain that an infection
in your arm could be lethal if it spreads. In this way, she builds trust with
you.&lt;/p&gt;
&lt;p&gt;These techniques which come as second nature to humans are not as automatic for
machines. We often stop short of developing models that are truly worthy of our
users' trust.&lt;/p&gt;
&lt;p&gt;Here's an example of what the output of a trustworthy model could look like:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Prediction: My best guess at the sale price of this home is $324,000&lt;/li&gt;
&lt;li&gt;Confidence: A 95% confidence interval on that number is ($315K, $333K)&lt;/li&gt;
&lt;li&gt;Explanation: The home is large, but the fact that it's on a corner brings its
   value down.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is a massive improvement over a single number! Not only is this a more
honest statement to make, it helps users understand why the model gave its
prediction. In turn, this helps users gain trust in the system and leads to
better outcomes.&lt;/p&gt;
&lt;p&gt;We have a responsibility as model builders to represent our work with integrity.
Shipping a model that is implicitly overconfident is bad for our users and our
businesses. Instead, we should develop models that are truly worthy of trust.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Sources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Duke, A. (2019). &lt;em&gt;Thinking in Bets&lt;/em&gt;. New York, NY: Portfolio/Penguin.&lt;/li&gt;
&lt;li&gt;Spiegelhalter, D. (2020). Should We Trust Algorithms? . &lt;em&gt;Harvard Data Science
  Review&lt;/em&gt;, 2(1). https://doi.org/10.1162/99608f92.cb91a35a&lt;/li&gt;
&lt;/ul&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/trustworthy-models.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Fri, 15 May 2020 22:26:30 GMT</pubDate></item><item><title>Your brain loves exercise, even if you don't</title><link>https://code2college.org/wellness-wednesday-building-strength-why-your-brain-loves-exercise-even-if-you-hate-it/</link><description></description><author>Samuel Taylor</author><guid isPermaLink="true">https://code2college.org/wellness-wednesday-building-strength-why-your-brain-loves-exercise-even-if-you-hate-it/</guid><pubDate>Wed, 15 Apr 2020 05:00:00 GMT</pubDate></item><item><title>Effective Learning (a review of Ultralearning)</title><link>https://www.samueltaylor.org/articles/better-learning.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Everyone advises being a "lifelong learner," but not all learning is created
equal. Many effective techniques are underutilized, and many common techniques
are useless.&lt;/p&gt;
&lt;p&gt;We buy a book or attend a conference. If we're really dedicated, we may even jot
a few notes down in the process. But rarely do we take a step back and ask what
the most effective way is to develop the skills we care about. Jeopardy great
Robert Craig says, "You can practice haphazardly, or you can practice
efficiently"
(&lt;a href="https://www.npr.org/2011/11/20/142569472/how-one-man-played-moneyball-with-jeopardy"&gt;NPR&lt;/a&gt;).
Unfortunately, most of us are practicing haphazardly.&lt;/p&gt;
&lt;p&gt;Fortunately, skill development is well studied. Two recent reads cover it well:
&lt;em&gt;Peak&lt;/em&gt; (by Anders Ericsson and Robert Pool; also covered in my &lt;a href="/articles/how-to-train-employees.html"&gt;last
post&lt;/a&gt;) and &lt;em&gt;Ultralearning&lt;/em&gt; (by Scott
Young). Three key pieces of advice from these books are to develop intuition,
focus on doing, and integrate feedback.&lt;/p&gt;
&lt;h2&gt;Develop intuition&lt;/h2&gt;
&lt;p&gt;If you've not watched &lt;a href="https://www.youtube.com/playlist?list=PLKtIunYVkv_RwB_yx1SZrZC-ddhxyXanh"&gt;&lt;em&gt;Gourmet
Makes&lt;/em&gt;&lt;/a&gt;
yet, you're missing out! The show's seen wild popularity for many reasons, not
the least of which is its host's intuition. Claire Saffitz has a wide range of
experiences that she draws on to recreate classic foods. Her explanations of why
she's swapping an ingredient or trying a particular technique reveal her mastery
of the subject and are incredibly interesting.&lt;/p&gt;
&lt;p&gt;This kind of intuition is explicitly identified in Young's book as one of the
principles of what he calls "ultralearning":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In a famous study, advanced PhDs and undergraduate physics students were given
sets of physics problems and asked to sort them into categories.  Immediately,
a stark difference became apparent. Whereas beginners tended to look at
superficial features of the problem—such as whether the problem was about
pulleys or inclined planes—experts focused on the deeper principles at work.
“Ah, so it’s a conservation of energy problem,” you can almost hear them
saying as they categorized the problem by what principles of physics they
represented. This approach is more successful in solving problems because it
gets to the core of how the problems work.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The experts in this story have better mental representations of Physics problems
than do beginners. It's not so much that they see past the intricate details of
each problem, but they are able to identify the details that matter most.&lt;/p&gt;
&lt;p&gt;Ericsson and Pool go so far as saying that the "main purpose of deliberate
practice is to develop effective mental representations". With focused study, we
exploit the wonderful adaptivity of the human brain, quite literally reshaping
it to be better at the new task. In an effort to minimize energy expenditure,
our brains pick up on patterns and encode those in structures to increase our
future effectiveness.&lt;/p&gt;
&lt;p&gt;Intuition is often the outcome of a long career, but we can develop it more
quickly. If we can get access to an expert, we can often gain intuition by
understanding how they think about things.&lt;/p&gt;
&lt;p&gt;You're not out of luck if you don't know such an expert! There's probably
somebody writing about your field online that you can learn from. For data
science, the winner's interviews on Kaggle's blog are an incredible resource.
For software engineering, I find High Scalability has a great roundup of
articles that can lead to a lot of insight about good design. Even Reddit is
sometimes a good resource.&lt;/p&gt;
&lt;h2&gt;Focus on doing&lt;/h2&gt;
&lt;p&gt;When undertaking a learning project, be very clear about what you want to do at
the end of it. Specific goals focus projects and ensure better outcomes. For
example, if a data scientist wants to understand deep learning techniques
better, she or he may decide to build a system for reading the sign language
alphabet from a user's webcam. Without a specific project, it's easy to spend
lots of time watching lectures or reading books that feel like productive uses
of time yet don't contribute to real skill development.&lt;/p&gt;
&lt;p&gt;I am explicitly not saying that books and lectures are unhelpful; on the
contrary, they are often the most rich sources of knowledge. But without
something concrete to guide our reading, we can waste time unwittingly.&lt;/p&gt;
&lt;p&gt;I have always loved learning. I collect information like some people collect
baseball cards. I find joy in having relevant tidbits of information to share
with people. One of the things I'm learning, though, is that taking in
information isn't an end unto itself. Ultimately, the thing that matters is what
that information enables me to create, be, or do. Explicitly choosing a desired
outcome for my learning projects helps me learn better.&lt;/p&gt;
&lt;h2&gt;Integrate feedback&lt;/h2&gt;
&lt;p&gt;Experimentation is key to mastery. We've got to try things, understand what went
well (and what didn't), and integrate those learnings into another attempt. It's
a feedback loop! Not all feedback is created equal, though. Young identifies
three types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;outcome feedback, like receiving a grade on an exam,&lt;/li&gt;
&lt;li&gt;informational feedback, where you're told what you're doing wrong (but not how
  to fix it), and&lt;/li&gt;
&lt;li&gt;corrective feedback, which goes beyond mistakes you're making and includes
  ways to fix them.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Of course the last type is the most useful, but it is also the most difficult to
get. In &lt;em&gt;Peak&lt;/em&gt;, the authors advocate strongly for the value of a coach/mentor
largely due to their ability to provide feedback. YouTube is a great way to
start learning guitar, but at a certain point you need a human being to provide
specific, individual feedback.&lt;/p&gt;
&lt;p&gt;But all is not lost if we have no coach! We can use a number of techniques to
gather feedback on our own. One I find interesting is the Feynman technique. To
start, write a problem down on a piece of paper. Then, explain the solution as
though you were teaching someone. Walk through not just the steps for solving
it, but the rationale behind doing so. The most valuable feedback in this
process comes when you get stuck; the parts that are hard to explain illuminate
where your learning can go deeper.&lt;/p&gt;
&lt;h2&gt;Closing recommendations&lt;/h2&gt;
&lt;p&gt;If you're curious about this stuff, I recommend both &lt;em&gt;Peak&lt;/em&gt; (&lt;a href="/articles/how-to-train-employees.html"&gt;here's my
review&lt;/a&gt;) and &lt;em&gt;Ultralearning&lt;/em&gt;. While they
overlap significantly, the former has more insight on organization-level
training and the latter is better for individuals structuring their own learning
programs.&lt;/p&gt;
&lt;p&gt;Life's too short for easy learning. Spend time doing the hard work of learning
difficult things well. Do so by developing intuition, focusing on doing, and
integrating feedback.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/better-learning.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 29 Mar 2020 12:25:19 GMT</pubDate></item><item><title>How to train data scientists and engineers: a review of Peak</title><link>https://www.samueltaylor.org/articles/how-to-train-employees.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Most companies with personal development budgets are wasting their money. If the
goal is to help employees master valuable skills, then we are misallocating
funds to books and conferences. Instead, we should look at what the research on
skill development says so that we can make decisions informed by data.&lt;/p&gt;
&lt;p&gt;Great people are hard to find, especially in software engineering and data
science. The people who you want to hire probably work for your competitors and
make more than you can afford to pay them. If instead of finding employees that
are already great you could help employees become great, then the process of
doing that would become a huge competitive advantage.&lt;/p&gt;
&lt;p&gt;I recently read the book &lt;em&gt;Peak&lt;/em&gt; by Anders Ericsson and Robert Pool. In it, the
authors espouse the value of deliberate practice and offer research-backed
insight into effective training practices. How might a company go about creating
top-tier talent?&lt;/p&gt;
&lt;p&gt;Before we talk about what works, let's think about what doesn't:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;attending lectures, minicourses, and the like offers little or no feedback and
little or no chance to try something new, make mistakes, correct the mistakes,
and gradually develop a new skill&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Unfortunately, many corporate training budgets are set up to send people to
lectures and minicourses rather than building good training programs. The
authors offer advice on how we might train doctors, though we could adapt their
recommendations to the tech industry.&lt;/p&gt;
&lt;p&gt;The first step is to find the experts. Ideally we could do this on a global
scale, determining the greatest data scientists in the world. Unfortunately,
even determining a methodology to find these people sounds prohibitively complex
and difficult. Fortunately, we can settle for an approximation. Bringing an
average engineer to the level of elite talent on the world stage would be
incredible, don't get me wrong, but even getting that person to perform like the
best engineer at the company would be a huge win.&lt;/p&gt;
&lt;p&gt;Finding the top talent at your company may involve: asking various individuals
who they hold in particularly high regard, examining performance review data,
and/or determining which individuals have had the greatest positive impact on
the business. If you're fortunate enough to have access to brilliant individuals
outside your company, all the better!&lt;/p&gt;
&lt;p&gt;Once we've found these experts, we move on to step two: understanding how they
think about problems. Ericsson and Pool go so far as saying that the "main
purpose of deliberate practice is to develop effective mental representations."&lt;/p&gt;
&lt;p&gt;Having done this crucial work of understanding highly effective individuals, our
third step is to build a "Top Gun" school. Modeled after an &lt;a href="https://en.wikipedia.org/wiki/United_States_Navy_Strike_Fighter_Tactics_Instructor_program"&gt;effective strategy
for training fighter
pilots&lt;/a&gt;,
&lt;em&gt;Peak&lt;/em&gt;'s authors recommend creating training programs that simulate the real thing
as well as possible while dramatically lowering the cost of failure. In our
industry, this might mean identifying a few JIRA tickets representative of a
team's work and having trainees work them under the watchful eye of high-quality
instructors. These instructors should point out failures to their students and
help the students to develop the thought processes (i.e.  mental
representations) of high performers.&lt;/p&gt;
&lt;p&gt;Note that we're not developing coursework on design patterns, data warehouse
design, deep learning, or a certain JavaScript framework. Instead, we focus on
doing the work:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One of the implicit themes of the Top Gun approach to training, whether it is
for shooting down enemy planes or interpreting mammograms [or developing a
predicting widget manufacturing capacity], is the emphasis on doing.  The
bottom line is what you are able to do, not what you know&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Building great teams isn't easy, but it is incredibly valuable. More than just
spending $50 on eBooks, companies should create training programs that use
data-driven insights into skill development to bring each team member to the
level of our best performers. Great products come from great teams, and great
teams are formed of great individuals. Great individuals are formed through
deliberate practice.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/how-to-train-employees.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 12 Mar 2020 17:19:13 GMT</pubDate></item><item><title>How to Handle Class Imbalance</title><link>https://www.samueltaylor.org/articles/how-to-handle-class-imbalance.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Alternate title: &lt;em&gt;Help! My Classes are Imbalanced!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Delivered at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://odsc.com/training/portfolio/help-my-classes-are-imbalanced/"&gt;ODSC West 2019&lt;/a&gt;. Slides available &lt;a href="/static/pdf/class_imbalance.pdf"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developerweekaustin2019.sched.com/event/XC6I"&gt;DeveloperWeek Austin 2019&lt;/a&gt;. Audio available &lt;a href="https://files.samueltaylor.org/imbalance_devweek.mp3"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.meetup.com/Aspiring-Data-Scientist-Community/events/266436999/"&gt;Aspiring Data Scientist Community&lt;/a&gt;. Audio available &lt;a href="https://files.samueltaylor.org/imbalance_adsc.mp3"&gt;here&lt;/a&gt;. Transcript below.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://anacondacon.io/samuel-taylor-bio"&gt;AnacondaCON 2020&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Find me on Twitter &lt;a href="https://twitter.com/SamuelDataT"&gt;@SamuelDataT&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Transcript&lt;/h2&gt;
&lt;p&gt;The start of this story is that at one point I was in undergrad, and I was in this machine learning class (the first
machine learning class I'd ever taken). And I was working on this Last.FM dataset trying to build a recommender system,
because I've always thought it would be really cool to build a computer algorithm that can help you overcome information
overload. And I thought this would be a cool way to try that. I started working on this little handwritten thing, and I
was feeling pretty good about it. And then I walked into my professor's office and I said, "Hey, you're not gonna
believe this. I have a 99% accuracy rating already, and I've barely even started."&lt;/p&gt;
&lt;p&gt;And he doesn't react like I wanted to. I wanted him to be like, "Wow, you are a prodigy. This is amazing!" But what he
actually says is, "OK, well, tell me a little bit more. What's the base rate of your problem?" And I say, "What was
that?" And he says, "What would happen if you just predicted the most common class for everything? What if you just said
nobody listens to anything?" So I tried that, and it turned out that that's exactly what the algorithm was doing. It was
just saying nobody listens to anything. And we're getting 99% accuracy because there's so many artists in the world and
there are so many people in the world that the intersection of that is going to be pretty small. So I was very sad about
this. And eventually, I found a solution.&lt;/p&gt;
&lt;p&gt;But the frustrating part about it was I kept running into this over and over and over again, this problem of class
imbalance. So this is a talk that I wish I had when I was in undergraduate school before I walked into my professor's
office and made myself look really stupid. I wish that I could have seen this.&lt;/p&gt;
&lt;p&gt;Before we get started, I do work for Indeed.
Every indeed presentation has this slide in it that says "We help people get jobs." If you like the idea of
helping people get jobs, come talk to me. Today we're going to talk about class imbalance. This is the way in which
we're going to do that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We will start off with what it is&lt;/li&gt;
&lt;li&gt;then we'll move on to how to figure out what is happening&lt;/li&gt;
&lt;li&gt;then talk about some solutions for it.&lt;/li&gt;
&lt;li&gt;And then at the end, I have some recommendations that sum everything up and try to tie a nice, intellectually tasty
  bow on this package, because there's gonna be a bunch of stuff.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So let's start off with what is class imbalance. Class imbalance happens when you have certain values of your target
variable that are way more common than other values. For instance, this is a wine classification dataset. You may
notice that there are orange points and there are blue points on this graph. And the orange points are way outnumbered
by the blue points. We can say that this dataset exhibits class imbalance because there are way more blue points
than there are orange points. There's a lot of things that can cause imbalance, and we're just going to walk through
them because understanding them will help us understand the solutions there.&lt;/p&gt;
&lt;p&gt;The first thing is there's just a lack of data. So this is a made up graph I have drawn. And let's say you have two
features. One of them is on the x axis and one is on the y axis here. And then you have some points that are orange and
some points that are blue. It's difficult to know -- where is the true blue region? Is it defined by an ellipse that
covers both of these? Is it defined by a rectangle that covers that? Are there two separate ellipses? I don't know,
there's just not a lot of data to really be able to infer that from what we have here. So that's part of the problem
with this.&lt;/p&gt;
&lt;p&gt;Another thing that can be problematic is overlapping. And this is from a paper I believe by Batista, where he and his
co-authors talk about the fact that sometimes even if you have heavy degree of class imbalance. If you look at sort of
the bottom right of this screen, there is an imbalance here -- you have way more blue points than orange
points, but you can still draw a linear separator between these two things. So the problem isn't actually that bad.
Really, because you can still separate them just fine. The problem only becomes worse when you start overlapping. And
that's when you start to see these things more toward the top left. These points would be difficult to determine or
distinguish one of them.&lt;/p&gt;
&lt;p&gt;Noise is another important factor in why this happens. You can imagine that we have some blue region that is some set of
points. And in the real world, you don't observe these regions, but for pedagogical reasons, go with me. So we have the
blue region and orange region, we got some points, way more blue points and orange points. And just by the the sad law
of large numbers, or our instruments being off or something, we measure these ones noisily. We accidentally think that
these points are way higher than they actually are. And that means that we are going to incorrectly think that this is
part of the majority class region (the blue region), when in truth it is part of the orange region. Because there were
so many more blue points that we saw. We had a much better shot of reading some points. And they just got overwhelmed.&lt;/p&gt;
&lt;p&gt;The one that I think is the most well theoretically justified is this idea of biased estimators. This comes from a paper
that I have linked at the end of the slides by Wallace, Small, Brodley, and Trikalinos. It's called &lt;em&gt;Class Imbalance,
Redux&lt;/em&gt;.  And it's just like, this really beautiful theory that they present. The crux of their argument is in this
figure. Here they display a binary classification problem in two dimensions. So just along the x axis here will say
that's our one feature. And then if it's an X, it's a minority class pointing. If it's a square, it's a majority class
point, we can see there's a lot more squares and exits. And what they find is that you have, you don't actually have one
distribution that you're drawing all the points from, you have two different distributions that you're driving From and
you happen to see some from majority class a lot more often. So you're sampling a lot more out of this orange
distribution than you are out of the blue pen. And ideally, in the greatest world or you can see where these
distributions live, we could draw this beautiful idea of separating line and would perfectly to the best of our ability
to separate these two areas. But because we have a class imbalance problem, the line gets biased toward the minority
class it gets pushed in the direction the minority class and that means that we're going to accidentally cut off some of
that region that should be part of the minority class and incorrectly allocated to be part of the majority class. So
those are those are causes listed signee breath Okay, how do you recognize it the first thing just look for it like
called on value counts, just know that it's happening because like, first time I ever ran this, it would think the check
for I just assumed stuff, so don't make assumptions bad idea. The next thing you should do is compare stuff. This is a
trick question. Everybody 97% accuracy. Is that good? Does anybody think? Anybody? If you Yeah, yeah, it's, I mean, when
you compare it to this, like 97% for our fancy classifier, but 94% for a really dumb classifier, then it looks a little
bit less impressive. It's like, Oh, we actually had an improvement of 3%. So one thing that that you can do, and that,
like psychic learn provides is this API for a dummy classifier. And you can have to do a bunch of stuff, you can have a
predict the most common class or just predict something at random. But I highly, highly recommend that when you run into
a class imbalance problem. We're really whenever you run into a problem trying to stupid classifier that you can use to
compare metrics and just sort of sanity check yourself. And this is like just just look at the next to each other and
you'll realize oh, 97 isn't good, because 94 you can get by guessing.&lt;/p&gt;
&lt;p&gt;This is the code for how to do that.&lt;/p&gt;
&lt;p&gt;Basically, you just like import this stuff. And then you call fit and predict just like anything else, and you can look
at the numbers and see for yourself how you're doing. So part of the problem with class imbalance. So far, all we really
talked about is accuracy. And that is a huge part of the problem that we have is is accuracy just assumes that every
area is the same, which is often not true. So let's use a medical example to really make this concrete, you can imagine
that we might be scanning someone's brain for cancer, and we have images of malignant tumors and benign tumors. And we
can make a mistake on either one. If we make a mistake on a benign tumor, that means we tell the patient Hey, we think
your tumor might be cancerous,&lt;/p&gt;
&lt;p&gt;we're&lt;/p&gt;
&lt;p&gt;going to need you to come in for some additional tests. And there is a cost associated with that, like they are going to
be worried and their families probably going to be worried and they're going to have to pay more for the extra tests and
your staff is going to have to run the extra tests like there is a real cost there. But it is very different from the
cost of making a mistake in the Molina case if we see in the limit And we accidentally say that it is benign. We're
going to send someone with cancer home and say, Oh, you're good, don't worry about it. And then they're probably going
to die because they didn't get initial screening. That's obviously terrible. So what this means is that if you use
accuracy, you are implicitly saying that the death of a human person is exactly as bad as cold calling someone in
traditional tests. And that's obviously absurd that you have a question you both can be correct. Yeah, yeah. You can be
incorrect in like both ways. Totally. Yep. So one set of metrics that people will often like to use is precision and
recall. So on this little diagram, here, we have some false negatives and true negatives. What this means in our medical
example is that for precision, what we're saying is, of all the tumors that I see, that I say are malignant, how many of
those actually turned out to be malignant and then recall it saying All of those malignant tumors in the world, how
many? Am I actually correctly identifying, as I recall sort of a way of knowing like, Am I pulling back out? Am I
recalling the points that I particularly care about? Another set of metrics that you should use rather than accuracy is
the receiver operating characteristic curve. And the gist of how you do this is you. Most classifiers can give you some
sort of probability or decision function. And then you can vary a threshold from zero to one, and then calculate surely
false positive rates and you put them on this curve. This is nice because it lets you sort of think about how good is
this model for various levels of a threshold and it sort of tells you implicitly about how well your model is ranking
points against each other. And this is a method recommended by CNET in May and their paper about our seniors, and they
find that when you don't know how bad the two alternatives are It's really good to use an rz curve because it sort of,
you aren't required to know those things in advance. So for instance, if you don't know, is it worse for a user to see a
job that they don't want to apply to? Or is it worse for a user to not see a job that they do want to apply to, like,
kind of hard to figure out like one of those is twice as bad as the other, and as you don't know, is often a good place
to use the area. And of course, the last thing I'll note on metrics when your accountability metrics and hope
depression, by the you know, a way to fix that, if you do know the cause, yes, we will talk about that in a little bit.
If you do know the cost, there are techniques that you can use for sure. And we'll talk about those in a second. So the
last thing I'll mention on metrics is to be really careful with the way that you do your training and testing splits. So
there's this really interesting paper that I didn't have time to put into this because it came out this year. But it's
by like, the lead author is a person named Luke. UQ up and is like the slides. And they do some really rigorous research
on the way that various metrics are affected by imbalance and changes in the balance.&lt;/p&gt;
&lt;p&gt;And the gist is&lt;/p&gt;
&lt;p&gt;that a lot of the metrics that you care about are probably going to be very different just based on the prevalence of
the minority class. So if you just by random happenstance happen to get a test split, where there's 5% of the minority
class versus woman 10% of the minority class, you're going to see dramatically different error numbers for those two
different things. And that's not necessarily reflective of some sort of underlying truth. It's more of a reflection of
the bias inherent in certain metrics. So would you want to what I would highly recommend you do is when you're doing
training, interesting splits do a stratified split, where you make sure that you have a very similar prevalence of the
minority class in each area in each slit. Okay, a lot of stuff. We'll talk about how to solve this problem. There's a
lot of different things you can do. The first thing you can do is kind of like eat your vegetables like everyone knows
it's a good idea to your vegetables. I'm like, I was eating tamales, you know, I, I know that I need to eat more
spinach. And I know that it's good for me, it's gonna make me Make me last longer in life. And that's kind of what this
is gathering more data is kind of the your vegetables like this is going to help you. It's it's good, but it kind of
sucks. So if you can't do that, or if you don't want to do that there's a bunch of other techniques that will actually
be the bulk of this discussion. This is sort of a taxonomy that is described in this really good survey paper by branko,
torgo and Ribera, where they talk about three different ways that the research has kind of thought about addressing this
problem. Those three being pre processing, special purpose learners, and prediction, post processing. So we'll go
through each of those in this little chunk here. First pre processing. When we talk about pre processing, we are
basically talking about taking our data set and either making more points or making fewer points, like changing the
district Of these things versus each other.&lt;/p&gt;
&lt;p&gt;So we're going to talk about oversampling. First,&lt;/p&gt;
&lt;p&gt;in oversampling, what you do is you take the data that you have, and you make more minority class points, you can do
that in a number of ways. The first way that I have up here is random, you just you just like take some points, and you
just duplicate them in your data set. It's difficult to see that that's happening, because these points are all on top
of each other. And this is in two dimensions. We don't have our fancy 3d glasses that are, you know, zoom into this. But
in the other examples, you can see more clearly what's going on. Where in smoke for instance, they're creating new
minority class points.&lt;/p&gt;
&lt;p&gt;So smote is a technique that is used for&lt;/p&gt;
&lt;p&gt;over sampling of the minority class. And this is sort of the algorithm of how do you do that? What you do, you take some
minority class point, and then you find its k nearest neighbors, which you can, you know, hyper parameters or just
figure out what the optimal value for K is. And then you pick a point is some percentage of the way between those
points. So your interpolating points to making new points is the&lt;/p&gt;
&lt;p&gt;idea, you keep doing this until you reach the level of balance that you want.&lt;/p&gt;
&lt;p&gt;The way that you choose the point the the way that you choose which of your neighbors to interpolate between matters,
and that's an area that has been researched further, the original paper just did it randomly. And that's what this smoke
diagram he's here. But there's been like updates and further research on this where people tend to find that nadesan is
a really good alternative, where instead of just picking randomly, you try to pick points that are closer to the
decision boundary. And so you can kind of see that there's a lot of points on this smoke diagram towards the bottom. And
like, we already know that the bottom is blue on this diagram, where we might need more help is when you're getting
closer to those orange points up for the top, and that's what a disinterested do. You can also go the other way. So
that's a oversampling idea. And that was your question. Yeah. So the whole Sunday is right. Good morning. Oh, that's
automatically class. So how is it really gonna affect the decision? Because the team about this mistake as it is because
they're just really morning for the minority class? Yeah, I have, I have some diagrams that I can use to explain this in
a sec. And the gist is that, that thing really can't earlier with the two sort of lines in line getting pushed toward
the minority class, when you have more of those minority class points that are able to sort of fight back and push the
point that that separating boundary away from that, and then there are some diagrams I can show you in a second.&lt;/p&gt;
&lt;p&gt;So,&lt;/p&gt;
&lt;p&gt;so understanding is the other way you can go.&lt;/p&gt;
&lt;p&gt;And coming back to our idea of noisy here. I want to remind you that if we zoom in on this little section, we're going
to see these two points next to each other. And the insight of this particular technique is to say that if we have two
points right next to each other, that are different classes, we probably measured one with noise and this I'm gonna be
honest with you is a total Life here is like not a bad thing. It's just like probably, you know, probably. So in
automatically when you're using them as an understanding technique, you just take the one for the minor from the
majority class and throw it away, you're just like, Yeah, that one's probably not legit, let's just not care about it.
And so that's an idea that you can do. The other thing you can do is way simpler, just randomly throw a points out the
majority class until you get to some closer and some closer approximation of balance. And that's actually what the
that's what the Wallace paper argues for. They find that these green lines here are sort of what happens when you when
you under sample to get to a balance point. And you can do that in a number of different ways. And you can see that you
get a lot of different planes like if you happen to throw a certain point it could draw, you could end up drying your
plane in a different spot. And so that can lead to different amounts of error in each For each plane, but the nice thing
is that all of these planes are less biased than the original one that we, that we inferred. So like this purple line
that I had here was our original biased estimator. And all these green ones are at least less biased than that. Now it
does suck that the error metrics are going to be different for all of them. And in the face of this variance, the
authors just suck it up. And they're like, All right, here's what we're going to do, we're going to bag things together
because value is a great way to trade off bias and variance. So we're just going to like, you know, sell some sell some
bias and gains, you know, gain a little bit on our various strengths. So you can take your data set and understand play
it under sampling in a bunch of different ways, and then diving classifiers further and get a more performance model. If
you're gonna do any of this, I highly recommend this. And if you are using Python, then use imbalance learn because they
implement a lot of this stuff for you. Okay, now, the deep breath pre processing Great, and it's not so great. So let's
talk about when it's good when it's bad. It's great because the libraries already exists for a lot of this stuff, which
is nice. It also kind of like gets the model closer to what you're looking for, it undoes some of that bias that we see
from the, the sort of Wallace paper. Now, this isn't really a good or bad thing, but it does change the cost of training
your model. So if you imagine that you have so much data that you can barely fit on your computer. If you're going to
start oversampling, that data, you're going to have too much data and you won't be able to train your model or even,
it's just going to make your model train longer. By contrast, if you are under sampling data, you're probably going to
have faster model train times because you have less data. So that's not really good or bad things just need to be aware
of when you're doing this technique. Now, it can be kind of difficult to apply this because you don't always know what
level of balance you want to get to, like should I go to, you know, where it's instead of 1% it's percentages, I try to
go straight to balance, it's not always clear. And you kind of have to explore that and experiment with that to figure
out what the right thing is a second way this can be difficult to apply. If you think back to that smoked example, that
was kind of dealing with real valued or floating point numbers, you can imagine if we had categorical data, it would be
kind of difficult to do that. And that's when like, if you're doing the word count vectors or something, like what does
it even mean to have point seven of the word apple in a documents like What does that even mean? So there are things to
try to there are sort of adaptations of the algorithm to deal with categorical data. But their lesson, I'll say, Okay,
next up on terms of various solutions to this problem, special purpose learners. You've probably already seen this, if
you haven't looked at the documentation for any of your libraries. I just went through and found some of my favorite
algorithms and copied the sort of doc string or the doc mutation and the highlighted a bunch of these have a way to sort
of specify a class week. And what this is doing is depends, you know, it various models who do different things. So in
three models, so this is going to affect is first of all the impurity calculations. So when you're going through and
trying to figure out what feature Should I split on the impurity calculations, we get weighted based on the waiting
nice, that's fine. And the other thing it'll affect is voting time at prediction. So if, for instance, you say that our
minority class should count for&lt;/p&gt;
&lt;p&gt;twice of the majority class, then if you get to the bottom of your, your tree, and you get to a leaf, and there is one
majority class point and one minority class point, the minority class, by the way, because it has two votes, and the
majority class one only has one vote. You know, it's different in different places, for SVM, so what this kind of does
is push the hyperplane that it learns away from the minority class and this is cool because it kind of does the same
thing of undoing Be undoing that bias that the walls paper talks about. And they find actually that doing this week,
this week based minimization is very similar to doing like an over sampling technique, and does the same thing, it will
just regression where it just sort of wins back some points for the minority class. In k nearest neighbors, it changes
the distance metric. And then like whatever else that changes whatever else, like every different algorithm waiting the
way in the minority class is going to do something different. And that's, that's, you know, cool because there's all
sorts of interesting things you can go off and learn about, but it also sucks because it means there's all sorts of
interesting things that you could go off and learn about. So instead of talking about every single different thing that
that class waiting does was talking about when this works and when it doesn't. That way. If you're in into this
situation, you can think okay, should I do last week? So, the research that I have read finds that when you have a
really highly recommended Waiting is less effective. This is from that Wallace paper and they basically just say like
when we have a lot more imbalance the chance that using this waiting technique is going to work is just less effective
and it's going to be more effective with more data this is where the whole feature spinach thing comes back in because
more data is always going to make this better. And they they have a like actual like really good theory backs like
equation theory of imbalance that they draw this reasoning friend and I highly recommend you very if you're gonna read
one of these papers is the last one in my opinion, because they haven't really interesting stuff to say. So good things
bad things on special purpose is very sort of two thirds of the way through these right now. Jim Christian Yes. Yeah,
that was so fun, more data. Data is embedded so is waiting because Let me see if it is effective for Right. Yeah. Yeah,
so the question the question sort of being, if I have a lot of data and I had my main balance, like, what what do I
like? Is it going to be good or not? And I think, like, well, this doesn't really make an argument for what happens in
both cases, it just says for a fixed value, they say that for a fixed value&lt;/p&gt;
&lt;p&gt;of imbalance, so let's&lt;/p&gt;
&lt;p&gt;say you have a dataset where only 1% is in the minority class, if you just get more points class rating will start to
become more effective. And then it will also say that if you ever given data set size, and one day is that happens to
have 10% versus 1%, the 10% that is that is going to have better results from using class waiting. Okay, good things,
bad things, good things. It directly addresses the issue. Bad things you have to still know like, what is your cost
benefit here? Which points do you care about more and how much do you care about certain points at other points, it's
nice that this works when you have a lot of data or when you are a little bit closer to balance. But if you don't
already have this class waiting idea in the algorithm, it's going to suck, you're going to have to really get a deep
understanding of the algorithm. Maybe write your own implementation of it and figure out like, Okay, what part of this
algorithm is suffering from bias? And how can I do that? And that's difficult to do? Sometimes. Okay, so the last kind
of group of techniques that they talked about is prediction post processing. And there's two things I want to talk about
here. First, is threshold selection. So oftentimes, when I ran into a class imbalance problem, my first thought is, can
I just turn this into a ranking problem and not have to have to worry about this so much? Where you might think of
something like I want to send an email that has 10 jobs to a user, the 10 jobs I think they are most likely to like in
that case, you know, the user probably We'd only likes point 1% of the jobs that you know about because they work in an
industry. But you don't really need to worry about that too much if you can rank the jobs against each other. And if one
has a, you know, point 05 percent chance of them liking it, and that will rank above a lot of the other stuff. And so
you can pick the top 10 in terms of whatever criteria you have. If you can't do that, though, it is foolish to just use
the default threshold that's like it gives you so you should very specifically choose the threshold that you want to use
to optimize your metrics. So if you have to specifically pick that number of like, what percentage chance do I need to
choose where above that point, I'm going to call someone back for their cancer screening. You need to pick that number
really carefully. The gist of how you would do this is you would get the probability output of your model, and then
figure out what metrics you care about. I put precision and recall on the screen Because we talked about them earlier,
but you can use whatever it is the metric that you haven't they care about. And you want to vary that threshold and
measure the various metrics that you care about. And you can then look at this and make an informed human decision on
what you want to do. So we imagine, in this grain cancer case, we are probably willing to sacrifice precision for
recall, we are willing to accidentally call some people back that are just fine, because we would rather do that then
let someone who has cancer go on No. But there's other cases where it's exactly the opposite. Like, if I am trying to
like, like fingerprint into my phone, the the, the cost of me missing it is like I get a little annoyed that the cost of
someone you know, getting into my phone, and that not being actually me is there's a lot because they're going to get in
there and they're going to read my memes and tell me they're not funny and that's gonna hurt my feelings. I'm gonna have
to go with their first one last night on this. Don't use your head. Set to do this, like do some sort of like crash
validation thingy over your training set, but don't use your tests that are also overfit your test set. And that'll be
really sad. And the other post processing technique that people talk about is an idea of cost based classification, this
more directly gets to this idea of there being direct costs for a false positive and a false negative. There's a couple
papers here, we're just going to talk about the senior and a paper where what they essentially find is when you have an
RFC curve, each point on that will correspond with some sort of threshold like that's the threshold for, you know, point
six, four, if there's a 64% chance of this person having a having cancer will call them back right. And we have a true
positive rate and a false positive rate. We can use that to calculate costs with this formula. It's not as scary as it
looks. The gist is we take the probability of it being in a negative class and multiply that by the cost of a false
positive and then multiply that by a number we can read off with our secret&lt;/p&gt;
&lt;p&gt;And then we read, basically we do the same thing, but for the positive class. So we figure out how big is the positive
class, what is the cost of making a false negative, and then read this number off of the RFC curve. So then we can sort
of add all this stuff together, read these numbers off. And the sort of these orange numbers here I have put into
signify we have a case where the minority class is 10% of the data set. And then I have had these blue numbers in here
to say that a false positive is five times as bad as a false negative, I happen to know that my priority I can plug
these numbers in, we can calculate these these values of cost or different thresholds and then choose the one that is
the best one, and in this case, 3.6, for that. So going back to that equation really quick, in my correct understanding
that the costs if they can be expressed in relative terms, any set of integers or floats that have that, that how that
relationship works. So you don't necessarily need to express the costs in real life. You can just say A false positive
versus false negative is x is x times worth more or less costly? Yes, yes, you're exactly correct that before. So this
formula doesn't like these don't need to be real life costs. They could be if you happen to know that a false positive
costs your company's $7, and the false name cause it to you could use that. But if you do, you can also just like, come
up with some thing. I'm gonna say this is five times it's worse than five times a bad or whatever. And those can be
whatever numbers you want. Yeah. Anyway, so then you pick the one with the lowest cost. And that's the threshold to us.
I want to point out this is different from the idea of special purpose learners. And so the first couple times I did
this, this is a little confusing, and I realized, because I didn't talk about it, so they're different. The idea of
special purpose learners is that you're modifying the algorithm to have this idea of waiting built into it in cost base
conservation when we're trying to choose the right threshold. We're doing this after we've already changed the
algorithm. raft regarding training model. And because of that, this means that we can use it out of the box of almost
any model as long as it provides some sort of probability output or some sort of decision function, which is very nice
and means you don't have to go fiddling about with the interior of every single model that doesn't have this bacon. So
good things, bad things for prediction, post processing, is pretty straightforward. It's just like, pick the threshold.
That's the best one. It's, it's a, it's a simple idea to explain. And it kind of gets at what we're looking for, which
is to try to optimize our metrics for some value. And it is also nice that you can use it with almost anything because
most models provide some sort of position function. A problem is that you is that this is not really studied a whole lot
in specifically imbalanced domains. Like the the survey paper that I am referencing for a lot of this, they only found
two papers that even talked about this and neither of those were specifically about class of balance issues. So, a lot
of things, right like let's talk about just some some highlights here, hit some some recommendations. Well, I think
based on what I read and what I have done in my life, and here's some some things I think about. I kind of think about
this like a Maslow's hierarchy of needs, like you need to have like clean air and water before you start worrying about
food. And then once you can worry about foods and not worry about shelter, like eventually you get a runs up and you
start caring about, you know, your therapists, calming you down from having bad names. So the the first level here I
would say is just establish some sort of baseline like a train that dummy classifier, train something stupid, and
compare it to your actual metrics. Unless you have a good reason otherwise, which you very well night. If you don't know
what to use, I would recommend using the area under the RF seeker because it is unbiased in these situations for a very
specific meaning of law biased. The next thing up from that is if you can try using classmates like just try it Saying
that your minority class is a certain amount more important. If your model if your algorithm supports that. And then
from there, I would recommend picking your threshold smart. Like don't just use point five. That's probably not the
right answer. It might be, but it probably isn't. Once you get to that, like if you're trying to eat more performance
out, or you're trying to address something else, I would say at this point, start using a random sampling technique. We
talked about fancy methods, we talked about smoke, we talked about Tomek links.&lt;/p&gt;
&lt;p&gt;The research did not bear those out as being super great.&lt;/p&gt;
&lt;p&gt;So the Wallace paper actually, they make a really strong recommendations that in almost all inbound scenarios,
practitioners should bag classifiers ever induced or unbalanced bootstrap samples. And then there's another paper by
Battista proxy and modar, where they find that random oversampling is really competitive to these really more complex
over sampling techniques like smoke. Which these kind of seem to disagree because one says always understand one the
other says always oversample my way of justifying this to myself is that the Battista paper doesn't take into account
this bagging element they just do under sampling. It's I think there's a lot of variance in what they're seeing, because
they're not doing this bagging. Yeah. So But either way, just try the random methods first, because they're probably
good enough. And only once you've tried doing that, do I think it makes any sense to start worrying about smoke or any
of these really complicated techniques? Because really, at that point, you probably have a bigger problem that can be
solved by just some fancy algorithm. It's, it's you need to go find more data. You need to figure something else out.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/how-to-handle-class-imbalance.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Wed, 30 Oct 2019 14:04:12 GMT</pubDate></item><item><title>Thinking in Bets for Data Scientists</title><link>https://www.samueltaylor.org/articles/thinking-in-bets-data-scientists.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Data scientists are uniquely positioned to provide leadership on their teams
around risk and uncertainty. We are trusted by our coworkers to have an
understanding of experimentation and data-driven decision making. This trust can
be leveraged to improve processes, decisions, and (ultimately) the output of our
teams.&lt;/p&gt;
&lt;p&gt;In Thinking in Bets, Annie Duke describes how to make good decisions, informed
by her time as a professional poker player. Life (she argues) and data science
(I argue) are like a game of poker. In some games, like chess, each player has
perfect information. They know where all the pieces are, what moves those pieces
can make, and the conditions for winning. But poker is a game of imperfect
information. Each player knows only the cards in their own hand. While they can
intuit things from the body language or play style of other players, that
intuition is not perfect. Some players are really good at bluffing. Some
behaviors are easy to misread.&lt;/p&gt;
&lt;p&gt;Poker, thus, requires players to make decisions in a system with imperfect
information (often with high dollar amounts on the line). Doesn't this sound
like life? The stakes are high, and we don't know what the future holds, but we
have to make some decision. By Duke's definition, that's what a bet is: "a
decision about an uncertain future".&lt;/p&gt;
&lt;p&gt;In the face of uncertainty, we data scientists resort to experimentation. We can
determine the best color for a button or the best copy on a page or any number
of other things by running a well-designed experiment. While this is valuable
work, it only scratches the surface in terms of where we can apply good decision
making.&lt;/p&gt;
&lt;p&gt;Running experiments is only useful insofar as they help us unlock real value
(i.e. moving our OKR's or KPI's). Think about it like driving a car. When you
press the accelerator, the tachometer shows an increase in the RPM at which your
engine is turning. This turning is then put through a system of gears and
eventually spins the wheels &lt;a href="#fn0"&gt;[0]&lt;/a&gt;. The ability to run experiments quickly
is akin to being able to turn the engine quickly. Without a good system for
choosing the right experiments and leveraging their results, we limit our
ability to impact the business. When we view decision making as a process of
choosing the right bet, we can choose strategies that help us make better
decisions and have greater impact.&lt;/p&gt;
&lt;h2&gt;Red Teams&lt;/h2&gt;
&lt;p&gt;The strategy from Duke's book that I found most directly applicable to
my work as a data scientist is Red Teaming. Established after 9/11, these teams
have as their express goal "arguing against the intelligence community's
conventional wisdom" &lt;a href="#fn1"&gt;[1]&lt;/a&gt;.
By "spotting flaws in logic and analysis," red teams help drive intelligence
agencies closer to both the truth and a proper understanding of the uncertainty
in analyses &lt;a href="#fn2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Within weeks of reading this book, my team and I happened to be working on
understanding our KPI's better, which involved some new analysis. This seemed
like a perfect time to apply a "red team" strategy -- as someone would posit a
result, I would see if I could disprove it. Whether I could or couldn't, I
reported on both. And then other members of the team would try to prove or
disprove my result! By this collaborative process, we came to understand the
truth of the situation where we could have easily misled ourselves.&lt;/p&gt;
&lt;p&gt;If you want to try this out, here's a few techniques I've found useful:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Explicitly try to disprove an analysis. If you have sufficient time, working
   to show how an analysis is wrong can be really valuable, even if it
   withstands scruitiny. You will likely find some small issues in the way
   something is calculated, some ambiguous terms or metrics that could be
   misunderstood, or an invalid assumption. These findings can lead to further
   quantification of their impact on the original analysis. In this way, we can
   gain a better understanding of how confident we should be in said analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Try to reproduce an analysis. Avoiding looking at the code for the original
   analysis, try to get the same result via a slightly different pathway. If the
   original author used raw log data, see if you can answer the question using
   the data warehouse. Come up with new metrics that should move in the same
   direction as those used in the first analysis. If two people come to the same
   conclusion independently, we gain confidence in that conclusion.&lt;br /&gt;
   You've got to be careful with this one! Knowing the hypothesis that is being
   tested can skew the analysis you do (even unconsciously) &lt;a href="#fn3"&gt;[3]&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Answer an adjacent question. Sometimes there just isn't enough time to fully
   reproduce or disprove an analysis. In these cases, we can test an upstream
   cause or a downstream effect instead.&lt;br /&gt;
   For instance, if our analysis finds that sales of trucks decrease when gas
   prices are high, we could look up years in which gas prices were high and see
   if truck sales were down.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;While these techniques are most effective when applied by a separate person who
hasn't been influenced by the same process/data as the original author, I have
found value in explicitly shifting my perspective to "red team" myself. Working
specifically to disprove my own analysis, I end up understanding the results in
greater depth.&lt;/p&gt;
&lt;h2&gt;Be humble&lt;/h2&gt;
&lt;p&gt;Humility is a key element of truth-seeking. We must remember that the point of
our work is not to prove to our teammates that we are geniuses; we're trying to
produce some positive result for our employer. We are more likely to find the
truth when we seek it rather than pursuing our own glory.&lt;/p&gt;
&lt;p&gt;I believe that sometimes our drive to compete gets in the way of our humility. I
am not a particularly competitive person by nature, but if you're reading this
and you are competitive, Duke has some advice for you:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Keep the reward of feeling like we are doing well compared to our peers, but
change the features by which we compare ourselves: be a better credit-giver than
your peers, more willing than others to admit mistakes, more willing to explore
possible reasons for an outcome with an open mind, even, and especially, if that
might cast you in a bad light or shine a good light on someone else. In this way
we can feel that we are doing well by comparison because we are doing something
unusual and hard that most people don’t do.  That makes us feel exceptional.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When red-teaming my own analysis, I sometimes find things that cast doubt on it.
Sharing my results, I'm tempted to leave these observations out. The desire to
present my findings in the best light possible is (I believe) a natural one, yet
one I must work against. Duke writes that "if we have an urge to leave out a
detail because it makes us uncomfortable… [it is] exactly the detail we must
share."&lt;/p&gt;
&lt;p&gt;Bringing reasons to doubt to the table along with the analysis itself helps know
what information we need to get. Sometimes a little bit of additional analysis
can alleviate the doubt. Sometimes the concern will turn out to reflect a small
enough risk that we don't need to address it. And sometimes the only way to get
more information is through an experiment. The important thing is bringing
uncertainty to the table so we can address it directly.&lt;/p&gt;
&lt;h2&gt;Communicating uncertainty&lt;/h2&gt;
&lt;p&gt;We communicate about uncertainty all the time. When asked if we're going to an
after-work social event, for instance, we say that we "might go" (which
typically means we are definitely not going) or that we will "probably go" (it's
a bit of a tossup). These phrases are examples of words of estimative
probability, or WEP's. In colloquial usage, these casual WEP's are just fine,
but they are less helpful when we're trying to make a good decision.&lt;/p&gt;
&lt;p&gt;For one, different words mean different things to different people. Andrew
Mauboussin's research shows that words like "maybe", "probably", and "usually"
are interpreted to correspond with wide ranges of probabilities depending on the
audience. For instance, when someone says that an event "might happen", her or
his audience could interpret that as an event with probability between 25% and
55%. That's a huge range!&lt;/p&gt;
&lt;p&gt;By using WEP's in communication, we run the risk that our audience will
misinterpret the likelihood we think a certain event has. But there are ways to
overcome this. Mauboussin advocates for explicitly giving a percentage alongside
words of estimative probability. This approach is used in the medical research
field, where institutional review boards require researchers to inform people of
the risks in treatments using WEP's &lt;a href="#fn4"&gt;[4]&lt;/a&gt;. These words should be
accompanied by a percentage; for instance, a researcher might inform a
participant using language like, "This side effect is rare (will happen to less
than 1% of subjects)".&lt;/p&gt;
&lt;h2&gt;Parting words&lt;/h2&gt;
&lt;p&gt;Certainty is alluring. But as data scientists, we should know better! The world
is filled with uncertainty, and only by defining and quantifying it can we drive
toward an accurate understanding of reality. This understanding, then, enables
us to make higher quality decisions. And this improvement in decision-making
doesn't have to stop at the individual; we can bring this idea to our teams,
departments, and companies.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Footnotes:&lt;/p&gt;
&lt;ol start="0"&gt;
&lt;li id="fn0"&gt;I think this is how it works; I'm really not much of a car person.&lt;/li&gt;
&lt;li id="fn1"&gt;Neal K. Katyal. 1 July 2016. "Washington Needs More Dissent Channels", &lt;a href="https://www.nytimes.com/2016/07/02/opinion/washington-needs-more-dissent-channels.html"&gt;The New York Times&lt;/a&gt;&lt;/li&gt;
&lt;li id="fn2"&gt;Ibid.&lt;/li&gt;
&lt;li id="fn3"&gt;Duke here references Richard Feynman, but I can't find a direct citation. Still, this seems to jive with my own experience.&lt;/li&gt;
&lt;li id="fn4"&gt;&lt;a href="https://web.archive.org/web/20130527211344/http://www.utc.edu/Administration/InstitutionalReviewBoard/faq.php"&gt;University of Tennessee, Chattanooga.&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/thinking-in-bets-data-scientists.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 20 Oct 2019 14:03:36 GMT</pubDate></item><item><title>Using Open Source Tools for Machine Learning</title><link>https://www.samueltaylor.org/articles/open-source-machine-learning.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;div class="embed-responsive"&gt;&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/V1Czgw_vxj8" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;/div&gt;

&lt;p&gt;Delievered at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://allthingsopen.org/talk/using-open-source-tools-for-machine-learning/"&gt;All Things Open 2019&lt;/a&gt;.
  Slides available &lt;a href="/static/pdf/open_source_ml.pdf"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pydata.org/austin2019/schedule/presentation/2/machine-learning-crash-course/"&gt;PyData Austin 2019&lt;/a&gt;.
  Recording available &lt;a href="https://www.youtube.com/watch?v=pRX1sLG_6cw"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Find me on Twitter &lt;a href="https://twitter.com/SamuelDataT"&gt;@SamuelDataT&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Transcript&lt;/h2&gt;
&lt;p&gt;Have you ever applied for a credit card? I know, I have. A few weeks ago, I was
up late at night. And I suppose that my idea of a fun time is to try to get some
credit card rewards. So all my friends are out partying, and I was like, "Man,
I'm going to get these airline miles!" I start applying for this card. And I get
through a certain point, and it has me fill out all this really personal
information. And then I click a button that says submit, and the page loads and
within the split second, it's telling me whether or not I got this credit card,
which like blows my mind, because I'm sure they don't have anybody reading this
application at 2am. And secondly, even if they did, that person clearly can't
make a good decision about this within a split second.&lt;/p&gt;
&lt;p&gt;So I'm like, how did they do this? And the secret is that they are using machine
learning. In other words, what they're doing is looking at past information and
using that to come up with some math and Radical formula for determining whether
they should extend me a line of credit. Because math is really fast that page
can learn really fast.&lt;/p&gt;
&lt;p&gt;I work for a company called Indeed, as a data scientist. This is a slide that is
in just about every presentation that is ever given at indeed, people are really
serious about this whole like mission that we have as a company if we help
people get jobs. So I feel very obligated to have this in the the slide deck.
And it would not be a true indeed presentation. If I did not mention that we
help people get jobs. If you're interested in any of this stuff. Please come
talk to me afterward. I would love to hear what you are into.&lt;/p&gt;
&lt;p&gt;For those of you
in this room, hopefully you have some idea of what this talk is to just clear up
the room clear, clear the air a little bit. We're going to talk about machine
learning today. This is sort of an introductory level of talk at the same time,
is extremely friendly to newcomers. So if you know nothing about machine
learning, you are welcome here and I'm incredibly glad You're here, you're going
to learn a lot today. And it's going to be really fun. At the same time, if you
do know a little bit more about machine learning, I hope that this is helpful to
you. I know that in my own experience, I find it really helpful to see what
other people are doing and the different ways that they're applying machine
learning techniques. So I hope that by this sort of use case approach, that
we're going to be taking today that you'll be able to see some problems that
I've run into and maybe think differently about your own problems. We will be
going through machine learning in an applications kind of way. I have found that
in my own experience, I learned best by doing and while we can't necessarily all
do in this room, I think stepping through real world problems can help us
understand why we need certain things and machine learning a lot better than a
strictly theoretical approach. At the same time. I respect the theory of machine
learning quite a bit. There has been a lot of really good science and research
that has gone into the theory of machine learning, which can help us do this job
better and help us apply machine learning a lot better. So we want to make sure
that even though we focus on application, we respect the theory.&lt;/p&gt;
&lt;p&gt;There's a lot of things this talk isn't. First of all, I don't have a PhD. So
you don't need a PhD to do this.  There's not a specific credential that's going
to make you great at machine learning. But because we only have 45 minutes
today, this is going to have some code examples in it for sure. But there's not
it's not like a tutorial style, like hands on thing that's happening in here. So
this is not like an end all be all reference. My goal is that by the end of
this, you'll have some idea of a machine learning is and sort of have your
appetite wet and want to go learn more about this. come up to me and talk to me
afterward. I'm happy to send more research resources along or talk to you about
what good next steps are.&lt;/p&gt;
&lt;p&gt;Here is the way that we're going to be doing this today. We will start off with
some stuff that we just need to know some groundwork on what machine learning is
and then We're going to be walking through a set of use cases and each of these
use cases, we will discover something about machine learning something about
maybe some, some different techniques that we have to apply.&lt;/p&gt;
&lt;p&gt;Let us start off with just what even is machine learning. Okay, if you're in
this room, and you have heard the phrase machine learning before, can I get you
to raise your hand?  Okay, it looks like we don't have any liars in here, which
is great. A lot of times when I've asked this question, there will be people who
have like, won't raise their hand. I'm like, You're lying. Like I've said it
already. In this time. Come on. If you feel like you've used machine learning,
like definitely in production, or if you if you've used it in some way that you
found interesting or fun, could you get a raise hand? Okay, awesome. We got a
great mix people in here. So if you are one of those intro people and you're a
little bit shy of coming up to me, one of those people who just raised their
hands I'm sure would also be very happy to help you. Cool.&lt;/p&gt;
&lt;p&gt;So let's talk about machine learning. If you're new to this, I have found that
getting into an Topic can kind of feel like riding the subway in a foreign city,
where you'll walk up to someone to say, Hey, I'm trying to get to this place,
how do I get there, and let's say, Oh, just get on the red line to this stop,
and then take that to the green line.  And then then you're there, it's fine.
But if you don't know what the subway system looks like, it's going to be
difficult for you to put that into your own mental map and remember it and get
through. So to provide a little bit of a map.  This is a sort of hierarchy or
taxonomy of machine learning that a lot of people use, where we talked about
supervised problems and unsupervised problems. And then there's a lot of other
stuff that is in the field that we won't really talk a whole lot about today.
We'll start with supervised machine learning. And that's what the bulk of this
talk is about.  supervised machine learning is machine learning where you have
to find inputs and defined outputs. We sort of break this further down into
classification problems and regression problems.&lt;/p&gt;
&lt;p&gt;And we'll talk about classification problems first. So this is that example that
I gave at the beginning of whether or not we're going to give somebody credit
cards, the input data is the stuff I have highlighted in yellow here. So someone
might come to us and say, Hey, I am 50 years old, and my net worth is $250,000.
And from that we make a decision on whether or not we want to give them credit.
Obviously, this is a simplified example. And as you can tell by this hand drawn
illustration, I made this data up. I don't know like there probably is a super
rich 12 year old out there who's just like, has $500,000. But that just ended up
happening because of the way enter these things to be easy to explain.&lt;/p&gt;
&lt;p&gt;So what we do in classification problems, and for all of the hype and
excitement, and joy that there is around machine learning, the dirty secret is
that all we're doing is drawing a line.  That's like the whole thing that we're
going to be talking about today. It's just fancy line drawing. And the real
magic of machine learning, which is really just math is figuring out good ways
to draw those lines that end up being helpful to us in the real world. So we
draw this line And then that's our classifier is one thing you'd call it, you
could call it a model. And when we want to then understand for a new point is
this person someone we should give credit to, we can just put the point on the
graph and look up, okay, is this on the approve or deny side and in this case
happens to be on the approved side. So we say that we would approve this person
for a credit card.&lt;/p&gt;
&lt;p&gt;regression is another kind of supervised machine learning. It is very similar to
classification in that we have input data, and we have output data. In this
example, the input is on the x axis, someone's net worth, and the output is on
the y axis the size of loan, we are willing to give this person. In this case,
we only have one input that one value for input, which in the last example, we
had two different input values. You can have as many as you want, or as few as
you want, as long as you have at least one because if you have no input
information, you're just rolling the dice. Anyway, again, what we're doing here
is we're going to draw a beautiful, fantastic line. This is our line that we end
up drawing, it seems like it kind of gets close to a lot of where these little
exes are on our graph. And we say that this is our model. And then we will when
we want to use it, say a new customer comes in and says, I have $500,000, they
will tell us that, and then we will draw a line from where they are on the x
axis up to the line and draw over to the y axis and we can determine this is the
line of credit, we're willing to extend it to this person. &lt;/p&gt;
&lt;p&gt;So that is all supervised machine learning when we have a defined input and
defined output.  unsupervised machine learning, as you might guess, from the
name is different from supervised machine learning. A key algorithm here that
you will probably run into is called clustering. And it's kind of weird, because
in the last examples, we had inputs and outputs. And in this example, we just
have data, we have some shapes with colors. And we're like, cool, I have this
data, I want to understand something better about it. One thing we might do is
walk up to a computer and say hey, turn this into three groups for me. I didn't
Here, look, I made these great groups for you, these are amazing. Or it might
say, Here's these groups, their group I shape, but look at how great this is.
And you can do it thank you computer. This is very kind of you to help me
understand the underlying structure of this data. But we don't have a defined
input and output that we're trying to figure out.&lt;/p&gt;
&lt;p&gt;There is a lot of other stuff in the field of machine learning is a lot of
really active research going on.  These include things like reinforcement
learning and active learning techniques.  I will say this, this looks like it's
a presentation slide. For me, this is actually an insurance policy against
people being mad at me on Twitter. So if your favorite algorithm isn't
mentioned, it is Look, it's right on this slide.  Yay.&lt;/p&gt;
&lt;p&gt;So to summarize, in machine learning, we are trying to use data to approximate
some function that we care about. We have some &lt;em&gt;f(x)&lt;/em&gt; and that takes an input of
x. And in this in our classification example, that would be someone's age and
their net worth, and we want to predict whether or not we should give him
credit. The problem with that is that we don't know what that function is
necessarily for instance, if we're trying to determine whether somebody will
default on a loan, they might have some weird random medical expense that causes
them to default on the loan. And there's no way we could have known that. The
thing that gives us hope is that we can gather some data that is measured from
this f of x, but it has some noise associated with it. So there's going to be
those situations that we don't know about, and we can't figure out and, and
these machine learning techniques all try to get around that noise and try to
understand what the truth underlying the noise is.&lt;/p&gt;
&lt;p&gt;So in summary, the way I think about machine learning is that it is a set of
algorithms which attempt to find a g of x, which is a good approximation for f
of x.&lt;/p&gt;
&lt;p&gt;With all that said, Let us begin with our first use case of the day. And this
will be the credit card application stuff on each of these. We will be going
through these five questions. So if you if you get this flow, that's what we're
going to be doing all day. The first thing we'll do is talk about what the
problem is. And in this case, we're wondering if we as the bank should issue a
certain consumer credit card. The data looks something like this. This is that
same data that I showed earlier, it's just a pretty good table because I figured
out how to use Google Slides better later on this. And what we see as we have
the input of age networks, and the output is whether or not to give them credit. &lt;/p&gt;
&lt;p&gt;This is the most fun part of this entire presentation is going to be this
question. This is the audience participation part. If you have some pent up
anger, you feel like you really need to have input into this discussion. This is
your time. shout at me.  What kind of machine learning problem is this?
classification anyone else think it's something else?  Wow. Very, very hive
mind, but very correct. Good job.  Yes. Yes, this is classification.&lt;/p&gt;
&lt;p&gt;Okay, and the next thing we'll talk about is the solution. There are so many
good machine learning libraries out there. No matter what language you're using,
I am pretty sure that it has a great open source. library for machine learning.
In Python, there's one called psychic learn that is incredible. Our has just a
whole set of things that you can do.  Java has a library called Wicca. Today,
we're going to be using Python because it's basically executable pseudo code.
And I believe that everyone in the audience will be able to follow along. So
let's do that. This will get you familiar with the way of using scikit-learn
specifically, but other libraries have similar thoughts and similar patterns.&lt;/p&gt;
&lt;p&gt;The first thing we do obviously, is just import the class that we need to use,
we can then set up the data and I've drawn that graphic and on the right here,
so you just see that it's that data. Then we instantiate the object. And now we
get to the critical part for psychic learn.  There's always these two methods
when you're building a model you always fit first, and that will actually do the
work of figuring out where this line is.  And then the next thing we do when we
have a question about a new point, we call predict, and then that will tell us
whether this point is supposed to be a approved credit or reject credit. Fit and
then predict.&lt;/p&gt;
&lt;p&gt;So we might not wonder, how accurate is this? Is this a good model? And in this
case, because I drew this data to be purposely easy, this is an incredible model
on this data, it is 100% accurate, it is beautiful. But in the general case,
it's not, it's not always going to be that easy, that have a situation like
this. Let me tell you how I made these graphs. The, what I did was I made up an
F of X, I just invented a function f of x. And then I drew a bunch of points
from FX and added some noise. So that's why you see this sort of scattershot
thing around probably where the true function is there, right? By show of hands,
who thinks that model a here this blue line that sort of sloping down into the
right is doing better in terms of error than Model B does? Does anyone think
that model a is closer to the true function, the model be? Right zero hands up?
This is not a trick question. You are Model B is much closer to the true
function, which is this green line going up into the right. And this is easy for
humans to figure out.&lt;/p&gt;
&lt;p&gt;But we need to come up with a way of helping computers to clear this up. The way
that we generally do this is by randomly splitting our data into testing data
and training data. This can be done with a function insight at learn. That's
called train test split. But the essence of what we do here is take our data and
just randomly assigned 20% of it to be testing data. And when we do that,
because that assignment has been made completely at random, when we train our
data on the training data, we can then call predict for each of the testing data
points, and figure out whether our model was right or not. So then, when we have
that done, we see these little colored regions on this graph. And you'll notice
that on each graph, the colored regions are the same, but the points are
different. The reason the colored regions are the same is because you should
only train your model on the training data has what's called the training data.
And then when we test it on the testing data, we can get some estimate for how
Good, our model is going to be at real world data. So you can see that we made
some errors here. Even in the training data, there are errors. And then we can
see that there are some errors in the testing data. And we might see that out of
our, you know, 20 points we have here to wrong. So we might expect that, for new
points that come in, we're going to, we're going to be wrong on roughly 10% of
them. And that might be acceptable and might not be acceptable, depending on
your problem.&lt;/p&gt;
&lt;p&gt;Ideally, at this point, you would just calculate the real cost of each of these
kinds of errors. So what I mean by that is, if we're trying to predict from
radar data, whether there's a warhead coming at the United States, and we say
that there's a warhead coming at the United States, and there actually isn't
one, that is a huge mistake, and like a bunch of people are going to die,
probably everybody, which is really bad. By contrast, if somebody is able to put
their finger on my phone and get into my phone, they're going to go to my
gallery and they're going to read all of my means that aren't funny and they're
going to make fun of me and that's going to hurt my feelings very deeply. Which
is, you know, that's going to be ours in therapy. After that.  Sure, and that's
a much higher cost than if I have to tap the back of my phone with my finger
again, it's like, I'm going to get a little annoyed. And in real problems, you
kind of see this often happen, where different errors are different levels of
costly. So ideally, at this point, you would be able to figure out what the cost
of your model is, and then figure out which one is the best.&lt;/p&gt;
&lt;p&gt;In real life, we don't always know what the real cost is. So we use these
error functions. To help us figure out what a good model is when we don't know
what the real cost is, means where there is a really common function that we use
to determine whether regression model is doing well. I have a graph here of the
true values and the predictive values in for a certain data set. And we take you
know the, we take the predictive value, the true value, subtract them, and then
square that difference to get a positive number. And then we add them all
together and divide by the number of points to get a mean. And now we say this
model has an air of 18. And then if we were comparing it to another model, and
the other model had an air of 17, we can know that the other The model is better
than this one. For classification problems, we could say that if we have some
points, and we have determined that in real life, half of them are blue, and
half of them are orange, but we predicted that three of them are blue, and one
of them is orange, we can see that we've made an error in that one case, and
then say that our classification error is 25%, roughly 25% of time we get it
wrong.&lt;/p&gt;
&lt;p&gt;So lessons learned is the last part of each use case. And this case, this
stuff is pretty neat. Like it's not hard to do this, you can pip install
something and get up and running in a matter of minutes. And it's not as
intimidating as it might sound. Another important lesson that we learned is that
when we split out our data into training data and testing data, we know we can
get an estimate for how good our model is.&lt;/p&gt;
&lt;p&gt;Let us move on to our next use case.  This will be talking about teaching a
computer sign language. So what's the problem the Problem is, I don't know sign
language.  But there are deaf people who only communicate in sign language that
I would love to communicate with. But I don't have a way to do that. And I was
trying to come up with a way that I could solve this problem. And I had recently
gotten this little kind of toy that sits on your desk, and it has a little set
of infrared LEDs in it, and you can plug it into your computer, and then it'll
give you It looks up at your hand with it has a little camera in it and a shiny
IR at your hand, and it can figure out what positions Your hands are in. So this
is an example of someone who's sort of holding their hands up like this kind of
over the little sensor. And you can see that each of these little balls on the
screen is a point in three dimensional space. And they're generally joints on
your hand. If you look at where these lineup and then the end of your fingertips
and there's one in the middle of the wrist gives you certain points on the hand.
So we were kind of trying to figure out can we use this thing to like teach a
computer sign language have a computer be translate sign language and somebody,
that'd be really cool. And we were going to this hackathon that was happening at
Texas a&amp;amp;m. And we thought, okay, sign language is actually really hard. I don't
think we're gonna be able to do all of that in 24 hours. So we figured what if
we just do American Sign Language? And then even further, what if we just do the
alphabet? And we thought, okay, maybe we can get something that will be able to
tell us what letter the alphabet we're holding over this sensor.&lt;/p&gt;
&lt;p&gt;So now we move into what the data looks like. we plugged the thing in, and we
held her hand over the sensor. And we said, Hey, this is an A, and we just
clicked a on our keyboard a bunch of times and move her hand around and got some
training data, never made a B and held it over the thing and hit be on our
keyboard a bunch of times and move your hand around to get some training did. So
what you see here is we have X, Y and Z points for each joint in the human hand,
it gives you 20 points, and then we have as our output value of the sign, so
that's an A, A, B, C, etc. So there's 26 of those in total.&lt;/p&gt;
&lt;p&gt;Here we go. Are y'all ready? What kind of machine learning problem is this.
supervised learning?  Yes, classification, both correct classification is a
certain kind of supervised machine learning.  Great job. Yeah. And it's it's
classification because there are only 26 values.  Great work.&lt;/p&gt;
&lt;p&gt;So when we started to solve this, the first thing we had to do was pick a model.
And there are a lot of different algorithms out there. The one that we showed
earlier was called a linear SBC. But there's a lot of machine learning
algorithms, and we didn't know which one was going to work best. So we started
off by splitting our data into training data and testing data. Then we just got
a bunch of models together, and we trained them all on the training data. And
then we evaluated them all in the testing data.  And we picked the one that did
best on the testing data. And we didn't really know what we were doing at this
point. But as I discovered later in life, and this is not the worst way to do
this, as long as you don't repeatedly do this, and you will end up having a
pretty good model by this. Doing this. You can run into a problem where if you
do this over and over and over again, what you end up selecting for is my models
that are really good at your testing data. And that testing data might in some
ways differ from real life. So as long as you're not doing this too much, you'll
be okay.&lt;/p&gt;
&lt;p&gt;So once we have this model, we figured, okay, it's cool, we have a
model, but like, we can't just tell people, it's a model for sign language, we
have to build some sort of application. And the first thing we tried to do was
make a keyboard. And that did not work very well at all, we could not figure out
a good way to figure out like when the hand was changing from like, between
signs and the accuracy on the model was actually not as good as he wanted it to
be. So sometimes we would make like a J. and j is actually assigned where that
it moves. And the pressure we were doing everything static until like, there are
certain signs that we just didn't have a good way to characterize and we weren't
doing super well on. So we tried making keyboard it did not work. But the
interesting thing was, it was good enough to make a little like Rosetta Stone
for sign language kind of thing.  And I think this was one of the things that we
we learned it was really important was It's not just about the model.&lt;/p&gt;
&lt;p&gt;So here's a little demo that I'll show you. Assuming the Wi Fi works. It does.
Yeah. So this is what the game looked like, you can see my messy desk. But it
would sort of show you a sign and say, Hey, make this letter, it's a B, or j or
whatever, and give you some amount of time. And you would go through and make
the letter.  And once you've got it, right, it would, you know, give you points
and reward you with some place in the leaderboard at the end. So I made a little
Rosetta Stone kind of thing that we called sign language tutor.  &lt;/p&gt;
&lt;p&gt;The code for this is available. If you just want to see how this stuff kind of
works. Feel free to go look at this. And I will be there's a link to the slides
at the end of this.  Also, if you want to just go there you can you can click on
this link and it'll load leveraging a lot of open source tooling is it was a
really helpful way get through this in 24 hours, psychic learn was obviously a
big plus. And then we also use Redis and flask as ways to make this possible.&lt;/p&gt;
&lt;p&gt;So
let's talk about some lessons here, it's really important that you come up with
a good way to define the problem that you're working on. We originally started
with saying sign language, this is this is what we wanted to tackle. But what we
realized was you have to scope it down much smaller than that. Oftentimes, the
best way to solve a big problem is to break it into smaller problems, and then
solve each of those individual smaller problems. This is something that I didn't
realize how practical it was going to be in real life. This isn't that I run
into all the time at work is we want to solve this really large problem. We
don't quite know how to do that until we can find a smaller subset of it that we
can. And that's what that's the important thing about limiting scope is figuring
out what can we actually achieve in a reasonable amount of time to prove that
this is a valuable thing to do. We also talked a little bit about how to select
models and it's important to do that and This is a reasonable way of doing that.
Critically, though, and this is another thing that I wasn't expecting would be a
lifelong lesson out of the hackathon was that the model isn't the only thing
that matters, you could have a model that's not good enough to be a keyboard,
but is good enough to be a language learning game. And this is something that if
you're working in a corporate setting, you'll want to work with your product
people and you want to go talk to your customers and really understand what they
need out of this and figure out well, maybe we can, even if it doesn't solve
this use case entirely, maybe we can reduce their workload by 50% or something
like that.  And that can still be a really valuable way to apply machine
learning.&lt;/p&gt;
&lt;p&gt;Next use case let us talk about forecasting energy load, we're going
to go through those same five questions. So the first thing, what is the
problem? problem here is that we need to know when to schedule energy
production, which by which I mean if we pretend for a second that we operate an
energy grid and we're trying to deliver power to a lot of residential and
commercial customers. We need to know when they're going to want Energy, then
they're going to use energy because we don't have good ways of storing it for
very long. I'm not a hardware person at all. I know nothing about engineering
like real engineering. But I don't think batteries are very good right now.
Like, that's the sense I have. And so we have to often schedule when energy,
like when we spin up our power plants in order to get that to be close to the
time when people need the energy so that it can get to them or something like
that. This is not an entirely hypothetical problem.  There's this agency called
the energy reliability Council of Texas, and I live in Texas. So this is why
this is a relevant example for me. But for those of you who are not familiar,
because you would have no reason to be familiar with the energy system in Texas,
we have this deregulated energy market where you can buy power from like whoever
you want, and those people selling you power then are turning around and buying
it from other people and and it's not this job to manage that grid and make sure
that things are happening at the right times, they sort of divided into these
zones that you see up here by weather. Because in Texas we, as I'm sure also
here in North Carolina, we love our air conditioners in the summer, we like to
not sweat. And that. So that's the weather is the driving factor for energy and
most of Texas.&lt;/p&gt;
&lt;p&gt;So let's talk about what the data looks like here. We have for each of these
weather zones, some amount of power being used on an hourly basis.  And what we
kind of have is the input data is the day and the hour. So just like the time
that energy is being used, and then the output data is we could pick any one of
these weather zones and say, okay, we want to build a model that can predict,
you know, overall usage, or we want to build a model that can predict usage just
in the south region or whatever. I just think this graph is also kind of fun. So
this is a graph of like energy usage over time. And you can see the seasonality
in here you can see when it's summer because there's these big spikes where
people are using their air conditioners a lot more. And then you can also kind
of see where winters were colder because you'll see people suddenly using their
heaters more, which is interesting. So even at this point, this is kind of cool.&lt;/p&gt;
&lt;p&gt;But at this point, we are now to the question of what kind of machine learning
problem this is we're trying to predict how much energy is going to be used at a
certain hour of the day. Regression didn't I don't think it's any anything else.
regression.  Okay.  Yes, it is regression. Yes, thank you all for your
participation, I really appreciate that. So, it is a regression problem.&lt;/p&gt;
&lt;p&gt;This is kind of an interesting different regression problem than the earlier
example we had of trying to predict how much of a credit line you should extend
somebody. And the reason that is, is because time series data exhibits
seasonality. So this is looking at the overall system load by week and you can
see that we have these ups and downs. And these correspond with the seasons of
the year because human behavior often maps pretty well with the seasons. And the
You see this a lot in time series data. And by time series data is simply mean
data where you have some time component of it.&lt;/p&gt;
&lt;p&gt;If you're using time series data, and you're doing what I already told you to
do, which is randomly split training data from testing data, you're going to
leak information and that will hurt you bad. So let's sort of look closely at
this orange point here, you'll see that it's kind of surrounded on both sides,
like both earlier and later, there are blue points.  And if this orange point is
a testing data point, and we try to predict what the energy value is going to be
for that specific day, the blue points around it, give us a lot of information
about what that orange point, hasn't it. And so, what you see happening is, is
that we know the future effectively if we keep these if we change if we split
the data randomly, our model ends up knowing the future with respect to some of
our testing data points, which is not a good thing and it won't happen in real
life.  So you can trick yourself into thinking you have a really good model when
you actually don't.&lt;/p&gt;
&lt;p&gt;When you're using time series data, it's important that you split based on the
time. So instead of doing that random thing to do kind of what I've drawn up
here, where you have here, I've done six different splits of the data.  And what
you would do ideally is split it up this way where you have Okay, up to a
certain day, this is training data. And after that is testing data, and that
more closely mimics what will happen in real life. We're in real life, we have
everything that we've seen in the past. And that's our training data. And what
we're going to be testing on is everything that's happening in the future. Now,
another critical thing to know about time series data is that some models don't
do a good job of picking up on the different seasonal trends, they're not able
to figure that out.&lt;/p&gt;
&lt;p&gt;So our Savior here is this open source library called Prophet, which has
integrated a lot of the learnings about time series data into a nice easy to use
package. And we'll sort of walk you through on each of those training, testing
splits it sort of figure out what the seasonality pattern so you can see you can
see with only a year of data, it doesn't really figure out what exactly is
happening, you can't figure out that this is sort of a sign, just a little wave.
But then as you get more and more data, it starts to become more and more
confident and know better and better, what the seasonal trends are.&lt;/p&gt;
&lt;p&gt;So we have learned some things today. The first thing that you should take
away is if you run into something that has a time component to it, you need to
be extra careful, because there's a lot of things about it that are special and
you can lead yourself astray.  seasonality is the biggest one, that when you do
if you are to do a random train test split, you will know the future when you're
doing training, which will mess you up.&lt;/p&gt;
&lt;p&gt;Alright, last one, we're going to talk about using machine learning to find your
next job. The problem here was that a few years ago, I was not like actively job
hunting, but I was just interested in seeing what's out there. You know, I was
just sort of passively looking around.  And when I signed up for like job
newsletters, they were way too noisy. I got way more jobs than I ever wanted to
look at and I was wasting reading through these emails I was like I don't want
to do is I want to get an email that has like three jobs in it. That might be
cool, right? So what I started to do was I would go look at job search listings,
and I would copy and paste the, the title and the company and then a link to the
job description. I did this Google Sheet.  And then when I was bored, like as I
was at the bus station, or waiting in line for something, I would go, and I'd
read your job descriptions and come back to my spreadsheet and Mark whether or
not it sounded cool to me. And I will have to admit to you here that I
definitely spent way more time reading job descriptions for this than I would
have if I just bit the bullet and dealt with the noisy emails. But if there's
not, I mean, you know, we're at an open source software conference today. So if
there's not like a safe place where I can be among over engineering nerds here,
then there's no good place for me. So cut me a little slack on the
overengineering here.&lt;/p&gt;
&lt;p&gt;So the next question is what kind of machine learning problem This, we're trying
to figure out whether or not a job sounds cool. Classification anyone think it's
anything other than classification? Okay, shaking hands. Yes. Great job. Yeah.
classification.  Wonderful.&lt;/p&gt;
&lt;p&gt;So at this point, we want to build a model. And we've seen in the past ways to
build models. But the ways that we've seen in the past are all numerically
based.  Someone's age can be represented as a number, the network can be
represented as a number, we can represent the day of the year as a number, we
can represent the hour of the day as a number. But how do we represent a job
title as a number?  This is kind of confusing. And so when I first ran into this
problem, I did what I highly recommend any of you do, when you don't know how to
do something, Google it. And so I searched like text representations and machine
learning.  When I got back was this thing, which I'll explain to you this is
what is often called a word count vector, or some people will call it like a
vector space model of language is not perfect, but it's a good start. Let me
explain how this works. You Put all of your job titles down the rows of the
matrix and you have all of your all of the words that occur in any job title
along the columns of the matrix. And then what you do is you have a zero or a
one. Or you can have more than that if there are if the same word occurs
multiple times in each little spot. So for the first example, we have seen your
web applications, developer data analytics. And so we look through each of our
columns and say, Does engineer occur in this job title, and it does not. So we
say it's zero. Web does occur in the job title, so we'll put a one there, then
applications career so we'll put a one there, etc, etc.&lt;/p&gt;
&lt;p&gt;And this is really boring to do by hand. So we should use a scikit-learn tool
that we'll talk about in just a second. But we're able to do with this is turn
this data scientist job title into this set of numbers. Now we have numbers and
we can take that sounds cool thing and turn it into a number just a one or zero.
Now we have numbers and we can just use the models that we already know about to
learn something from these numbers.&lt;/p&gt;
&lt;p&gt;This is the way that we can kind of do that.  It's just some example code of
using scikit-learn. And so the first thing we do is gather together our data. So
we take our titles and our whether or not it sounds cool and turn them into a
matrix. And then we use this thing called account vector Iser, which is the
thing that does this word count vector creation. And it's the name count vector
Iser, and then we call this method that's fit underscore transform. What this
does is it causes the &lt;code&gt;CountVectorizer&lt;/code&gt;, to count up or to figure out what all
the words are, and then make the word count vectors, we then turn it into an
array just because it's more convenient to do that. Then we have another model
that we haven't spoken to today. But there's a model called logistic regression
that we can use and fit it on this data that has now been transformed into being
vectors as well as the rate of the ratings. Then we can take some new jobs and
predict whether or not they sound cool. And then we get this array at the bottom
here that sort of commented out, which says the first job did not stop Cool. The
second job did not sound cool. But then that fourth job there. Sounds cool.&lt;/p&gt;
&lt;p&gt;So I did this, I ran this and I use that error metric that I told you earlier was
a good error metric called the classification here, came out and it said, I had
a point 197 classification error, which means I am right, about 80% of the time.
And I'm thinking this is amazing. I am the greatest data scientist in the world.
This is awesome. But, it turns out, I realized what was happening was, in all of
the job titles that I'd read all the job listings that I'd read, only about 20%
of them sounded cool. And it turned out that what my model and figured out was,
if it just said nothing sounded cool, it would only get 20% air and it's like,
that's pretty good. Let me just do that this is way easier.&lt;/p&gt;
&lt;p&gt;So this is a self portrait I drew after I realized what had happened and was
very disappointed and very, very sad about the fact that my model had realized
that it could exploit this sort of imbalance in the data. One way that we can
combat this is by using Another tool for evaluating error called a confusion
matrix. And in this, we take and put the predicted labels on one axis and actual
labels on another. And then we count up the number of examples. For instance,
this top left has the number of points, the number of jobs, which I said they
weren't cool, where they actually weren't cool. And then to the right of that we
have the number of jobs that actually work well, that I said weren't cool. And
we can see from this, we can see Oh, my model is just predicting zero for
everything, and we can catch the air a lot more easily.&lt;/p&gt;
&lt;p&gt;Another way we can do this is through metrics like precision and recall,
precision gives us a number which quantifies for all of the jobs that I say are
cool that my model says are cool. How many of them actually turned out being
cool? And then recall tells me for all of the jobs that are actually cool, how
many of them am I bringing back to the surface? How many of them are my
recalling and saying are cool? As if I was using these error metrics? I would
see that I had a recall of zero because I wasn't bringing back anything. jobs
that work cool.&lt;/p&gt;
&lt;p&gt;Some other techniques that are common when you're dealing with unbalanced data
are over sampling. And under sampling. In under sampling, what you do is you
just take that majority class, so we had 80% of our data sounded not cool. And
we would just throw away a bunch of it until we got to an even split. And then
at that point, we would train a model on that synthetic data set on that data
set that had half cool and half not cool. And our model would do a little bit
better than just always saying, not cool. &lt;/p&gt;
&lt;p&gt;Another way we can approach this is through a technique called over sampling,
where we just duplicate points in the data set that are that are cool, and that
way we can get back up to an even split that way. The reason we have to do this
is that some models kind of assume that you have an even split between points,
and that's not always necessarily the case. And there's a lot of detail that
goes on with this imbalance stuff and you can do a lot more research about this
and I would be happy to talk with you more about this. But just so you have an
idea of some ways to approach It over sampling and under sampling are both good
ways of addressing this problem.&lt;/p&gt;
&lt;p&gt;What I ended up doing here was using under sampling.  And then what I did was I
made myself a little email thing that will automatically go look at my
spreadsheet and find the 10 jobs that sounded the very coolest and then I get
this much shorter email. And my beautiful over engineer Grizzle would come back
I would see this there are some things that we have learned. Let's talk about
that. First thing is to understand the base rate like know if your model is
actually doing good or not, you need to know what would happen if I just did the
stupidest thing possible. I just predicted the most common taste, what would
happen. Another thing is it simple doesn't mean ineffective. I started this
trying like hoping that I would end up using deep learning or something. And
then it was good enough with like the simplest model that you learn on like day
two of class. And that kind of made me sad because I really wanted to try out
something cooler but it ended up working and that was great.&lt;/p&gt;
&lt;p&gt;The reason for this is something that's called the approximation just
realization trade off. And I will explain what that is right now. What this
means, as you might guess from the name is that there's a trade off between the
level of approximation that we can do and the level of generalization we can do.
approximation means for that training data that we have, how good is our model
at representing that training data? So we can see here, our red model, which is
a nearest neighbor model, which is capable of memorizing data sets is doing any
incredible job at memorizing the data set, it approximates the training data
perfectly, it has the exact right answer for everything. Well, on the other
hand, our green model is not really approximating the input data set very well,
it's kind of a little bit far away in certain places. And that could be
problematic. Here's where the simple model can do really well though. When we
start to wonder about the generalization, our civil model does a lot better,
because it has given up on some approximation to get itself closer to the true
function. You can see that this red model is really far away from a lot of the
points in the The testing data set, whereas the green model is doing a lot
better on those points. It's a lot closer to those points. So the question is,
do we want to get really good at knowing the training data? Or do we want to
generalize and learn some more broad pattern. And when we use simple models, we
usually win in terms of the approximation, generalization a trade off.&lt;/p&gt;
&lt;p&gt;The other thing that's nice about simple models is that they're just easier. And
it's been a little while since I've tried to set up TensorFlow and pytorch on my
laptop, but it was not very fun last time. And so I do learn is really easy to
set up.  So I would highly recommend just trying something simple and starting
out with that.&lt;/p&gt;
&lt;p&gt;Okay, y'all, take a deep breath. I'm going to take some water.
There's a lot of stuff there.  We're going to we're going to summarize real
quick after this, but just everyone take a deep breath.&lt;/p&gt;
&lt;p&gt;Alright. We talked about supervised learning, where we use past examples to
predict a continuous value in the case of regression, or a discrete value in the
case of classification. We also said that to measure the performance of our
models, it's really smart. It's a really good idea to split the data into
subsets of training and testing data. Another thing that we mentioned was that
it's important to keep it simple, stupid, the simplest thing that could possibly
work might actually work. And then you don't have to do anything more
complicated. And you will probably learn something valuable along the way. The
last takeaway I have for you is to test and iterate, build a model.  Try it out,
see if it's good. If it's not, you can always build another model.  And if it
is, you're done. That's great.&lt;/p&gt;
&lt;p&gt;Thank you guys so much for coming. And I want to give a quick shout out to my
employer I work for Indeed, and we like data and we like helping people get
jobs. If you're interested in any of that or you just want to talk about machine
learning, I would love to chat with you about anything data related. My Twitter
handle is up there -- I talk about data online.  Or if you have another session
to get to and you do have a question but you don't want to come up after feel
free to email me. I love getting email from people who care about data stuff,
that's my personal email and it will get to me. But yeah, I hope that from this
you're able to take and do some really cool machine learning stuff. Thank you.
And if you do have questions, feel free to come up. I'd love to talk to you.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/open-source-machine-learning.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 13 Oct 2019 21:07:55 GMT</pubDate></item><item><title>Don't try to sound smart when giving a presentation</title><link>https://www.samueltaylor.org/articles/dont-try-to-sound-smart.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Think about the last time you gave a presentation. Were you nervous? Excited?
Scared to look stupid in front of your boss? Confident in your ability to wow
the others in the room? These emotions reveal that at our core, our biggest
desire when presenting is to make ourselves look good. This desire is
antithetical to a good presentation and harmful if not addressed directly.&lt;/p&gt;
&lt;p&gt;Our first priority must be the audience. They're giving us their attention, and
we have the responsibility to use that precious resource wisely. If we waste it,
they'll be less likely to give us this attention in the future. While we may
think we're being clever with our sales pitch, people are smart enough to see
right through us (even if what we're selling is our own self image).&lt;/p&gt;
&lt;p&gt;Prioritizing the audience requires us to adapt our message to them. This doesn't
mean dumbing it down. As an example, consider presenting some interesting result
to two audiences: a coworker and the CEO. The CEO of your company will require
more context than your coworker. This does not mean that the CEO is somehow
dumb; it means that she isn't as familiar with your work (for obvious reasons).&lt;/p&gt;
&lt;p&gt;Instead of prioritizing the audience, we often prioritize ourselves (typically
without even realizing it). We want to appear smart, competent, and confident in
front of others. Subconsciously, this leads us to present in ways that obfuscate
the truth in order to make ourselves appear intelligent. If our audience leaves
a presentation thinking that we are a genius, we have probably failed to explain
our idea in a way that they can understand.&lt;/p&gt;
&lt;p&gt;We should go into presentations hoping our audience finds what we say to be
"common sense". If they agree with what we've said without much questioning, we
have likely guided them through our ideas in a way that is easy to understand.&lt;/p&gt;
&lt;p&gt;If you've never given thought to the specific people that are in your room, try
it out. Do research to understand what they know and how they best digest
information. If you already do this, consider doubling the amount of time you
spend here.&lt;/p&gt;
&lt;p&gt;When we go into presentations seeking to look like a genius, we end up confusing
rather than enlightening. Those who try to look impressive ultimately fail to do
so, but (paradoxically) those who give up on impressing do.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/dont-try-to-sound-smart.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 29 Aug 2019 14:15:15 GMT</pubDate></item><item><title>A five minute meeting shouldn't ruin your productivity</title><link>https://www.samueltaylor.org/articles/a-5min-meeting-shouldnt-ruin-productivity.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;As developers, we deplore being interrupted. A stream of Twitter posts lament
that a five minute meeting can harm productivity for hours. But it doesn't have
to be like this: we can dramatically decrease the cost of being interrupted with
a simple technique.  What's more, implementing this technique makes us more
effective and confident in our work.&lt;/p&gt;
&lt;h2&gt;Problems with this narrative&lt;/h2&gt;
&lt;p&gt;I can't talk about this without first critiquing it. The framing of these
complaints is often adversarial.&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;That five minute meeting with a developer... &lt;a href="https://t.co/xPgYwDzCnY"&gt;pic.twitter.com/xPgYwDzCnY&lt;/a&gt;&lt;/p&gt;&amp;mdash; Richard Campbell (@richcampbell) &lt;a href="https://twitter.com/richcampbell/status/1142130583998533633?ref_src=twsrc%5Etfw"&gt;June 21, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;When we create a "developers vs. others" mentality, we decrease the level of
trust and psychological safety on our teams. Instead, we should seek to
understand the "other" in our midst; why does our PM have to ask a dev a
question so often? Is our documentation bad? Are we changing direction too
quickly?&lt;/p&gt;
&lt;p&gt;A good working relationship allows both sides to understand each other better.
If we want to decrease the amount of interruptions we experience, posting
divisive tweets is a strategy that may actually have the opposite effect in the
long term. &lt;/p&gt;
&lt;p&gt;Working alongside every member of our team is a much better strategy for
creating change than spreading insular, us vs them ideas.&lt;/p&gt;
&lt;h2&gt;Taking matters into our hands&lt;/h2&gt;
&lt;p&gt;We aren't helpless here; we can change our own behavior to address this issue.
Here's what's been most effective for me: taking notes. It's a simple but
powerful technique. A solid set of notes enables us to change contexts with
confidence, knowing that returning to a previous context will be easy.&lt;/p&gt;
&lt;p&gt;Because of the time/effort it takes to gain, well, context on a certain task,
context switching is hard. Human brains have limited capacity for remembering
things like "What line was it that was causing that exception?" or "Which
features were most important in the last training of this model?".
Conveniently, computers are &lt;em&gt;great&lt;/em&gt; at this kind of thing.&lt;/p&gt;
&lt;p&gt;While I started taking notes to avoid keeping too much state in my brain, I have
realized a couple other benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Record of results:&lt;/strong&gt; as a Data Scientist, I do a lot of experiments. For new
  problems, I'll often try several different things. Seldom do they all work!
  My notes here function as a sort of "lab notebook", enabling me to review the
  experiments I've run and reflect on their implications (either by myself or
  with colleagues).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multiple projects:&lt;/strong&gt; when I'm blocked on something, it's become much easier
  to switch to work on another project. Having a few projects at once is much
  easier to manage when the current state of each project is written down rather
  than in my head (or on JIRA).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Writing good notes&lt;/h2&gt;
&lt;p&gt;So how do we take good notes? Here's what works for me. Each project gets a
document. While working on the project, I append things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Surprising discoveries (for example textual descriptions, drawings, data
  visualizations)&lt;/li&gt;
&lt;li&gt;Helpful bits of documentation (StackOverflow questions, company wiki)&lt;/li&gt;
&lt;li&gt;Descriptions of what I'm trying to do and how I'm trying to do that. Even a
  few words or a file name and line number are enough to jog my memory.&lt;/li&gt;
&lt;li&gt;Results of things I've tried. Did a certain test fail? What was the MSE when I
  include this feature?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is a fractal process, so I find it useful to use bulleted lists with
multiple levels.&lt;/p&gt;
&lt;p&gt;The particular medium we use for this isn't important. I use OneNote (because
its free tier offers everything I need), but there are a slew of other products
you can use. Digital note apps are convenient because they sync across devices,
but you could also use plaintext documents, paper notebooks, or even a
whiteboard if you prefer.&lt;/p&gt;
&lt;h2&gt;You can do this!&lt;/h2&gt;
&lt;p&gt;You may think that creating this set of notes is more trouble than it's worth.
What surprised me when I started doing this was that writing down my progress
ended up being fairly easy. All in all, it beats the heck out of losing your
place because someone interrupts you.&lt;/p&gt;
&lt;p&gt;Of course, this isn't a panacea. Being interrupted still can cause us to lose
track of things we haven't quite written down yet. And we do have to build the
habit of making good notes. But the costs are worth it.&lt;/p&gt;
&lt;h2&gt;You &lt;em&gt;should&lt;/em&gt; do this&lt;/h2&gt;
&lt;p&gt;A written log of work gives us confidence that we can execute on our projects.
Because we track state for each project, we can stop worrying about forgetting
something valuable. Every time we step up to a task, we have a helpful record of
what our goal is and the ways we've tried to achieve it. We free up space in our
brain for more useful tasks.&lt;/p&gt;
&lt;p&gt;A five minute meeting doesn't have to ruin your productivity. By letting
computers do what they're good at, we free our brains to do the difficult,
valuable intellectual work that most fulfills us. Take notes as you work–it's
your new superpower.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/a-5min-meeting-shouldnt-ruin-productivity.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 22 Aug 2019 11:53:02 GMT</pubDate></item><item><title>Recognize Class Imbalance with Baselines and Better Metrics</title><link>https://engineering.indeedblog.com/blog/2019/08/recognizing-class-imbalance/</link><description></description><author>Samuel Taylor</author><guid isPermaLink="true">https://engineering.indeedblog.com/blog/2019/08/recognizing-class-imbalance/</guid><pubDate>Fri, 02 Aug 2019 05:00:00 GMT</pubDate></item><item><title>Model-agnostic feature importance through ablation</title><link>https://www.samueltaylor.org/articles/feature-importance-for-any-model.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;&lt;img src="/static/img/selecting_a_record_grande.jpg"&gt;&lt;/img&gt;&lt;/p&gt;
&lt;p&gt;Feature importances are, well, important. We can use them to provide a
rudimentary level of interpretability; if a feature has higher importance, it
has greater impact on the target variable. Some machine learning models have an
innate way of calculating feature importance (decision trees, for instance).
Others don't have a way of doing this (for example, support vector machines
using an RBF kernel). Further, some models result in a set of coefficients (like
linear regression) that are easy to misinterpret (e.g. if you have two features
with dramatically different scales).&lt;/p&gt;
&lt;p&gt;Feature ablation is a technique for calculating feature importances that works
for all machine learning models. Given a dataset of &lt;em&gt;n&lt;/em&gt; rows and &lt;em&gt;m&lt;/em&gt; features, the
procedure goes like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Train the model on your train set and calculate a score on the test set.  You
   can pick whatever scoring metric you like.&lt;/li&gt;
&lt;li&gt;For each of the &lt;em&gt;m&lt;/em&gt; features, remove it from the training data and train the
   model. Then, calculate the score on the test set.&lt;/li&gt;
&lt;li&gt;Rank the features by the difference between the original score (from the
   model with all features) and the score for the model using all features but
   one.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Example code&lt;/h2&gt;
&lt;p&gt;Here's an example of how we could actually perform this procedure in Python
using scikit-learn.&lt;/p&gt;
&lt;p&gt;First, we import some things we'll need: &lt;code&gt;load_digits&lt;/code&gt; will load in the digits
dataset. SVC is the model we'll use. &lt;code&gt;train_test_split&lt;/code&gt; is a utility method that
splits the dataset into training and testing portions. &lt;code&gt;sklearn.metrics&lt;/code&gt; has a lot
of pre-defined metrics in it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_digits&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.svm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SVC&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;mx&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We load and split the data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_digits&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now we define a function which will train and score a model for us. Given the
data, it creates and trains a support vector machine, then returns the accuracy.
Finally, we store the score of our model with all features into &lt;code&gt;base_score&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SVC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scale&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;rbf&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;base_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, we iterate through all features, creating an array &lt;code&gt;use_column&lt;/code&gt; which we use
to select all columns except for the one which we're currently scoring. We store
the score of a given model in the list &lt;code&gt;scores&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;use_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ndx&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ndx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="n"&gt;use_column&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                              &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="n"&gt;use_column&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                              &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, we get the top 10 features.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;base_score&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
       &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;ndx_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ndx_score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
       &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;[(12, 0.005555555555555647),&lt;/span&gt;
&lt;span class="sd"&gt; (21, 0.005555555555555647),&lt;/span&gt;
&lt;span class="sd"&gt; (5, 0.002777777777777879),&lt;/span&gt;
&lt;span class="sd"&gt; (10, 0.002777777777777879),&lt;/span&gt;
&lt;span class="sd"&gt; (17, 0.002777777777777879),&lt;/span&gt;
&lt;span class="sd"&gt; (18, 0.002777777777777879),&lt;/span&gt;
&lt;span class="sd"&gt; (20, 0.002777777777777879),&lt;/span&gt;
&lt;span class="sd"&gt; (34, 0.002777777777777879),&lt;/span&gt;
&lt;span class="sd"&gt; (37, 0.002777777777777879),&lt;/span&gt;
&lt;span class="sd"&gt; (46, 0.002777777777777879)]&lt;/span&gt;
&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Relation to stepwise regression&lt;/h2&gt;
&lt;p&gt;You may recognize this idea as being similar to &lt;a href="https://en.wikipedia.org/wiki/Stepwise_regression#Main_approaches"&gt;backward stepwise
regression&lt;/a&gt;.
Wasserman (2005) describes this technique for model selection as "we start with
the biggest model and drop one variable at a time" (p. 221). We drop variables
until the score has decreased beyond some acceptable level or until we have
reached the desired number of features. He notes that this is a greedy search
and is not "guaranteed to find the model with the best score." If we were to use
&lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html"&gt;scikit's recursive feature
elimination&lt;/a&gt;
in combination with this feature ablation technique, we would be using backward
stepwise regression.&lt;/p&gt;
&lt;p&gt;If you do decide to apply stepwise regression, be careful with the test set used
to evaluate the features. If you choose features that optimize the score on the
test set, you are overfitting to the test set (and any metrics calculated for
the test set will be incorrect). If performing stepwise regression, I would
recommend splitting the training set into 5 folds and performing cross
validation to select features. After that process, metrics calculated on the
test set remain valid (because it was not used during training).&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This technique provides a general way to calculate feature importances for any
classification or regression model (even those that don't natively support
them). It's also an element of a feature selection technique called stepwise
regression.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Comments? Questions? Concerns? Please tweet me
&lt;a href="https://twitter.com/SamuelDataT"&gt;@SamuelDataT&lt;/a&gt; or email me (sgt at this
domain). Thanks!&lt;/p&gt;
&lt;h2&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Wasserman, L. (2005). &lt;em&gt;All of statistics: A concise course in statistical
  inference&lt;/em&gt;. New York: Springer.&lt;/li&gt;
&lt;li&gt;Grande, E. &lt;em&gt;Browsing record store shelves&lt;/em&gt;.
  &lt;a href="https://unsplash.com/photos/0vY082Un2pk"&gt;Unsplash&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/feature-importance-for-any-model.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Tue, 23 Jul 2019 20:50:22 GMT</pubDate></item><item><title>Linear interpolation in Postgres using generate_series</title><link>https://www.samueltaylor.org/articles/postgres-linear-interpolation.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;I like to keep track of how many miles I'm driving in my car. One conceivable way of doing this is to create a table in
a Postgres database in which I can track this information.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;mileage&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;observed_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;observed_mileage&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Unfortunately, I'm not always the most regular data collector. I often collect this data with gaps of days or months
between each reading.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;mileage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2018-05-21&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;84088&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2018-05-26&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;84201&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2018-06-13&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;84910&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I want to get some sense for how much I'm driving each day, and one reasonable way I might do that is to linearly
interpolate the mileage between readings. For instance, if I see a reading of 10,000 on August 1st and a reading of
11,000 on August 10th, I want to see that on average I drove 100 miles each day 1-10 August.&lt;/p&gt;
&lt;p&gt;How can we do this in Postgres? First, we pair the data up:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_date&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_mi&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_date&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_mi&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mileage&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which yields result:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  lag_date  | lag_mi |  obs_date  | obs_mi
------------+--------+------------+--------
   |        | 2018-05-21 |  84088
 2018-05-21 |  84088 | 2018-05-26 |  84201
 2018-05-26 |  84201 | 2018-06-13 |  84910
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, we generate a series between each pair of dates:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;paired_dates&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_date&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_mi&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_date&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_mi&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mileage&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;paired_dates&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lag_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obs_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1 day&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driven_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which yields result:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;  lag_date  | lag_mi |  obs_date  | obs_mi |      driven_date
------------+--------+------------+--------+------------------------
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-21 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-22 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-23 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-24 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-25 00:00:00+00
 2018-05-21 |  84088 | 2018-05-26 |  84201 | 2018-05-26 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-26 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-27 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-28 00:00:00+00
 2018-05-26 |  84201 | 2018-06-13 |  84910 | 2018-05-29 00:00:00+00
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that &lt;code&gt;2018-05-26&lt;/code&gt; occurs twice in the &lt;code&gt;driven_date&lt;/code&gt; column. We can fix that by stopping our series just before
getting to the later date:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;paired_dates&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_date&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_mi&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_date&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_mi&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mileage&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;paired_dates&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lag_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obs_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1 minute&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1 day&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driven_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Anyway, now we need to calculate the actual number of miles driven on the &lt;code&gt;driven_date&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;paired_dates&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_date&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_mi&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_date&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed_mileage&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;obs_mi&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mileage&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;driven_date&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obs_mi&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;lag_mi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;NUMERIC&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obs_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;lag_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;miles_driven&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;paired_dates&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lag_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obs_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1 minute&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1 day&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driven_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which yields result:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;     driven_date     |    miles_driven
---------------------+---------------------
 2018-05-21 00:00:00 | 22.6000000000000000
 2018-05-22 00:00:00 | 22.6000000000000000
 2018-05-23 00:00:00 | 22.6000000000000000
 2018-05-24 00:00:00 | 22.6000000000000000
 2018-05-25 00:00:00 | 22.6000000000000000
 2018-05-26 00:00:00 | 39.3888888888888889
 2018-05-27 00:00:00 | 39.3888888888888889
 2018-05-28 00:00:00 | 39.3888888888888889
 2018-05-29 00:00:00 | 39.3888888888888889
 2018-05-30 00:00:00 | 39.3888888888888889
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And thus I have achieved the desired result.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/postgres-linear-interpolation.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Fri, 07 Sep 2018 20:21:04 GMT</pubDate></item><item><title>Machine Learning Crash Course</title><link>https://www.samueltaylor.org/articles/machine-learning-crash-course.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;div class="embed-responsive"&gt;&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/n1BMtnBTjGY" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;/div&gt;

&lt;p&gt;Delivered at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://anacondacon.io/"&gt;AnacondaCON&lt;/a&gt; on 10 Apr 2018. Slides available &lt;a href="/static/pdf/machine_learning_crash_course_anacondacon_2018.pdf"&gt;here&lt;/a&gt;. Video embedded above.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://windycity.devfest.io/"&gt;Windy City DevFest&lt;/a&gt; on 1 Feb 2019. Slides available &lt;a href="/static/pdf/mlcc_wcd2019.pdf"&gt;here&lt;/a&gt;. Video available &lt;a href="https://www.youtube.com/watch?v=7pOy5bSdXX8"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Transcript&lt;/h2&gt;
&lt;p&gt;Have you ever applied for a credit card? A few weeks ago I was up late at night and apparently my idea of a good time is
to try to churn some credit card rewards. So I was on a bank's website entering information about myself, and I got to a
point where I had to hit a submit button to submit my application and I clicked submit and within a split second the
page loaded and they told me whether I got the credit card, and my mind was just blown by this because I thought,
"Surely they don't have a squadron of people sitting around reviewing applications. It's 2:00 a.m.. And even if they did
those people couldn't possibly make a decision that quickly. And I was just very confused by this. But the secret here
to how they're able to do this lies in machine learning. In other words people can take past examples from history and
use it to create a formula that allows them to predict future outcomes. I am Samuel Taylor as introduced and today we
are going to be going through a crash course in machine learning. There are a lot of things that this talk is and
aspires to be. And chief among them is to introduce y'all to machine learning concepts. If you don't know anything about
any of this stuff I hope that you are able to come in here and feel like you took something away. &lt;/p&gt;
&lt;p&gt;But if you do have a little bit more experience I hope that this is going to be helpful for you still because you'll be
able to see some different approaches that you might not have considered before and hopefully that will be beneficial to
you. I definitely want to orient this toward application because I find that sometimes when we talk about machine
learning especially at an introductory level it becomes a really weird and hypothetical scenario where you're saying,
"Oh well if you have this situation then you might want to do this" and it just gets weird to talk about and it's a lot
more interesting to talk about an application level. That said it is still very respectful of the theory in this field
because there's a lot of really talented people who've put a lot of really good effort into understanding the way that
these models work. And if we are able to gain some of that understanding, we'll do better at the application. However,
you're not going to be a data scientist after this. We have about 45 minutes. This is not going to be an entirely
comprehensive event here. There is a lot of stuff that I won't be able to cover and a lot of things that we're going to
have to gloss over because again there's only 45 minutes here. I think that tutorials and detailed code examples are
really interesting and there's going to be a few code snippets but those are, I think, less applicable in a forum like
this. &lt;/p&gt;
&lt;p&gt;I'm happy to talk to any of you afterward about specifics and give you links to GitHub repos and stuff if you're
interested. But for the purposes of this talk we will try to stay a little higher level and talk about what the models
used were and some of the techniques and try to focus less on the specifics. This is how we're going to be doing all of
this stuff. We'll start off with a few minutes just on what even is machine learning and then we have sort of a warm up
use case that is this credit card application example I'm talking about. And then we'll dive into 3 use cases. Basically
three problems that I ran into and the ways I ended up solving them including what I learned from those. Finally we'll
just wrap up everything and hopefully tie a nice little intellectually tasty bow on this package. So for those of you
here in this room if you've heard the phrase machine learning before raise your hand. OK, some of y'all liars because I
have said it at least three times already, but now I know which ones of you are aren't telling the truth. If you feel
like you've done something with machine learning that you found significant or just interesting, I'd love to see a hand.
Awesome. OK. For all of you, I hope you get something out of this for the rest of you, we're so glad you're here, and I
think that by the end of this you'll be able to try to do something really cool. &lt;/p&gt;
&lt;p&gt;All right. There is a lot of stuff that is machine learning really and a lot of times people break it up into supervised
and unsupervised problems. Again this isn't comprehensive. Supervised learning is what we're going to talk about for
most of today and what we'll talk about first. In supervised learning we have a set of data and a set of output data and
the input data maps to the output data. So we could have, for instance, credit card application data that maps to
whether or not someone should get the credit card. So let's talk about classification first. In classification we have
data that looks something like this. This is again for credit card applications, and this is entirely hypothetical data
that was generated by me drawing some stuff on a graph and then reading the graph. So it's completely fake but it gives
you an idea of what's happening here. In supervised machine learning we have our features and our output, and here the
output is whether or not someone is given credit. That's the thing we're trying to predict, and the input (or the
features) are the age and the net worth. &lt;/p&gt;
&lt;p&gt;So we have some input and we're trying to produce some output. You could take all this data and put it on a graph and we
say we have seen these six credit card applications in the past from people of various ages of various net worths and we
want to know whether to give them a credit card, right? And for all of the hype and excitement around machine learning
this is pretty much just drawing lines. That's all we're going to end up actually doing here. So we're going to do some
really fancy line drawing today. And we can draw a line through here that separates these data points and then when a
new data point comes in, we can ask the model, "Hey, should we approve this credit card?" and it can look and say,
"Well, it's on the 'approve' side of the line so we'll approve this person." &lt;/p&gt;
&lt;p&gt;Regression is very similar to classification in that it has still input data and output data. But instead of predicting
what kind of thing something is (trying to predict a discrete value with a certain number of outputs), it's trying to
predict a continuous value. So we're trying to answer the question "how much" of something. To continue in the bank,
let's say we are trying to give people loans, and someone comes in and they say "My net worth is this, I want a loan.
How big of a loan are you willing to give me?" And you could have a bunch of data that looks something like this. And we
could see well we gave this person with a million dollars a very large loan and we gave a person with less money less
money. And then again all this is is just fancy line drawing. So we draw this line and that's what we call "fitting" the
model and then we'll want to predict a new customer that walks into our bank and they'll say, "I have $500,000, how much
of a loan are you willing to give me.?" And you just jog up to the line and then jog over to the Y-axis and say we're
going to give them sixty thousand dollars. So pretty simple stuff with supervised learning. &lt;/p&gt;
&lt;p&gt;Unsupervised learning: also interesting. One of the biggest areas in it is called clustering. It's probably one of the
key algorithms to know in this space. With clustering, we don't have any any outputs. We just have inputs pretty much.
We see we have nine data points and here each data point is a different shape and we might have a table that has the
number of sides and the color that this shape is but we don't really know what what we're looking for here. With
clustering we're just trying to uncover some underlying structure in the data. So we might hand this to a computer and
say hey split this into three groups for me. And it seems like you can split this into three groups a number of ways but
the computer might just say here's three groups of these things. And it will provide you with those 3 clusters. &lt;/p&gt;
&lt;p&gt;There's a lot of other stuff and this goes back to the non comprehensive aspect of this talk. This looks like a slide.
It's actually secretly an insurance policy against angry people on Twitter saying that I didn't cover their favorite
subject. &lt;/p&gt;
&lt;p&gt;So all in all the way that I sort of think about machine learning is that we want to find some function in the world.
There is a mathematical function that describes whether someone should be approved for a loan whether they are going to
default on a loan that exists out in the world. The problem is that we can't possibly know that because there are so
many factors going into it. I mean, if someone comes into an unexpected medical expense they might default on a loan and
there's all sorts of different factors that affect this. And so it's impossible for us to know all of these things. The
interesting thing and the thing that gives us hope is that we're able to measure some points from this. We can look at
the past and see: well, we know that these points aren't a perfect representation of the actual function but we can see
kind of some area around it and then build something from there. &lt;/p&gt;
&lt;p&gt;So the goal in machine learning especially supervised learning is that we are trying to find some algorithm which will
give us a G of X to approximate F of X. So we're not going to be able to find the true function, but we want to find
something close to it. So all that said we're going to be walking through three use cases today and I wanted to just
have a warm up one so that we can all get used to this. We're going to be walking through five questions for each of
these cases: What's the problem? What does the data look like? What kind of machine learning problem is this? And then
we'll dive into some details on the solution, and we'll talk about lessons learned. So in this case for our credit card
application data the problem is we're trying to decide as a bank whether we should give a consumer a credit card and the
data we've already seen today. It kind of looks like this. We have some input features we have output. So here is the
time for all of your beautiful faces to shout at me if you're angry. Get all of your aggression out and tell me, "What
kind of machine learning problem is this?" Supervised, yes. Yes. Any more specific than supervised? You're going to be
like a little. Yeah. Classification yes. Y'all are brilliant, this is amazing. Yes, so this is a classification problem,
and we could solve this by drawing a line because that is what we're doing today. &lt;/p&gt;
&lt;p&gt;There's a library in Python called scikit-learn that implements a lot of these algorithms for you so you don't have to
waste the time doing that yourself. And basically what you can do is import a model and say, "fit this model to the data
that I have," and that will do the step of drawing a line and then you say, "predict" on a new data point, and you give
it the input data, and it will give you the output that it thinks is true. That's sort of what you do. But then we kind
of ask ourselves what what did we do here? Did we actually accomplish anything? Is this any good? So we want to know how
accurate this thing is. In this case, obviously this is very contrived data and it's deliberately drawn to be easy. And
obviously it's going to get 100 percent accuracy on this. But what does that mean, and how can we think about this more
generally? &lt;/p&gt;
&lt;p&gt;To give you an example I have taken a true function that I made up, and I put it on a graph. You can't see it yet. But
then I drew some observations from that function to sort of model what happens in the real world and that's all these
little blue points that you see. And then I fit two different models to them. You can see they do different things.
Basically one of these is better than the other. I think you might be able to tell. Does anyone think Model B is better
than model a approximating the true function underlying this? OK. OK is anyone going to be brave and say that model A is
doing a better job here? &lt;/p&gt;
&lt;p&gt;OK. OK. Some brave souls out there. So in this case the true function is actually definitely modeled better by Model B.
In usual situations we don't actually know what this green line that I've drawn here which represents a true function is
we just have observations. And so for a human being we can look at this and say, "OK yeah clearly Model B is better" but
there's caveats of being able to tell a computer how to do this. So what we end up doing is holding out some data for
testing and it looks like this is a little bit small on the screen up there. So I will describe that about 20 percent of
these points are red and the rest of them are blue. And what we've done here is we had some data that came in to us and
we said, "Some of this we'll call training data and some of this we'll call testing data." And what we can do is
basically train our model on the training data and then predict the values for the testing data and compare the two.
Then after that, ideally what we could do is calculate the actual cost of making an error. So if I was designing a
system that could tell from radar data whether a nuclear warhead was headed toward the United States and I said there
was one and there actually wasn't one that would be a huge mistake. Probably a lot of people would die. Maybe everybody.
However if someone is able to fingerprint into my phone that is a much lower cost than literally the destruction of the
human race. &lt;/p&gt;
&lt;p&gt;It's basically going to be some dank memes are going to get out there and that's not really going to change the world
that much. So ideally you'd be able to determine what kind of error you're making and how expensive it is and optimize
your model for that. In the real world it's not always as clear. And so there are some error metrics that you can use to
try to help you understand how your model is performing. One common metric for regression problems is mean squared
error. I've drawn a little example up here with some true data and then our predictions. And basically what you do here
is you subtract the true value and the predicted value and then you square that number and you sum all of that up and
then divide it by the number of points are predicting for and we can say, "We were off by 80 you know the metric here is
eighteen point three five" and you try to minimize that error when you're comparing models to each other. &lt;/p&gt;
&lt;p&gt;In other cases like classification you can't use that because it doesn't make as much sense but one common metric that
makes a lot of intuitive sense is classification error and basically you're just trying to see, "Did I do the right
thing.?" So if on four of these points I correctly classified that they were a certain class then I would say my error
was twenty five percent on those points. So in this use case we've learned a few things. The first thing I would say is
that this stuff is just pretty neat. There's cool things we can do. It's less intimidating than we thought it was; it's
just drawing lines. And then also that it's important to withhold testing data so that way we can evaluate how our
models are doing. So in our next use case we are going to talk about teaching a computer a sign language. &lt;/p&gt;
&lt;p&gt;So the problem here is that I don't know sign language, and I want to communicate with deaf people because they have
valuable things to communicate. And I was trying to think about this problem and a friend and I were going to a
hackathon when we were in university and we were thinking of what we could do with this little toy that we had. This is
a thing called the Leap Motion. It's a little sensor that has an IR LED in it, and you can see that in that picture here
because cameras are weird. I guess I don't know all the the weird camera stuff that makes infrared visible to this
camera but not to the human eye. When you're actually looking at it in person, you can't see that. But basically what it
does is it flashes infrared light up on the human hand and can tell basically where hands are in space. So for this
example someone has a little sensor sitting on their desk or potentially mounted to like a VR headset and they're
holding their hands kind of like this up to it and you can see what the computer is seeing here. Each of these little
dots that is connected by the little pipes I guess is a place on the human hand. By and large just joints in the hand
and then also fingertips. And we were looking at this and thinking OK what if we could do something with this data to
try to help with this problem with sign language. &lt;/p&gt;
&lt;p&gt;And we only had 24 hours, so we thought, "What if we limit the scope here, and we just try to do something for American
sign language?" (because there are a lot of dialects of sign language). And also we just tried to do the alphabet. Let's
start small and see if we can get anything working. So we gathered some data. We took our little device and plugged it
into the computer and then we made an "A" above it and we said "This is an a, this is an a, this is an a", a bunch of
times we gave it a bunch of examples of what an A looks like. And then we made a B above it and we said, "This is a B,
this is a B, This is a B" and we taught it. We gathered data about what these things looked like. You can see here that
we have x, y, and z coordinates for each of 20 positions in the human hand. And so you end up getting 60 features for
one output and the outputs here are the letters of the English Alphabet A through Z. So I think y'all know what time it
is; it's time to shout again. What kind of machine learning problem is this? &lt;/p&gt;
&lt;p&gt;Yes, definitely supervised. I also heard someone say classification over there I forget who it was. But you did a great
job. So yes this is a classification problem because there are a discrete number of things we're trying to predict.
There's just 26 things and we know it's one of those 26 things. &lt;/p&gt;
&lt;p&gt;So let's talk about how we went about solving this problem. The first thing we needed to do was choose a model and
that's something that you'll often need to do when you're doing machine learning stuff. And we were somewhat early in
our time of doing this kind of stuff so we weren't really sure what we were doing. We just took a bunch of models from
scikit learn and just said try all of them and then we will evaluate all of them on the test data and pick the one that
did the best. And that is actually, as I learned a little bit more, that's actually not a bad thing to try. Just try a
bunch of different things and see what works best. Then once we did that it's not enough to just have a model; we had to
build some sort of interesting application around it because it's not very cool to walk up to someone to say, "Hey, I
made this thing that can tell you from handwriting data if you're making a sign for a certain English letter." That
doesn't mean anything to anybody. And so we thought we would try to make a keyboard, and we were working on it and it
would read out whether we were making an a, a b or a c or whatever and put those on the screen and we could see oh cool
we're doing this. &lt;/p&gt;
&lt;p&gt;It turned out that it wasn't quite accurate enough. And we even tried to do some stuff to do some Markov Chain stuff
which basically takes into account the fact that certain letters are more common after other letters. So for instance
"U" is way more common after "Q" than after a lot of other letters. Anyway we weren't able to get that to work very well
and so we decided what if the answer here is to just try and try to make a different application around the same model.
And we found that we could make a little educational game around this and we basically tried to market it as Rosetta
Stone but for learning sign language. And this is a little demo that I will play. So it's a little game and it'll tell
you to make a certain letter with your hand and then you try to make your hand look like that letter and then once you
get it it gives you some points. At the end you can put your name in on a scoreboard because everybody likes to compete.
Anyway, that's sort of what we built. And there were a lot of things we learned in this process. &lt;/p&gt;
&lt;p&gt;The first thing that I didn't expect to be useful broadly because I thought oh we're just doing this for a hackathon.
It's not a big deal. Limiting scope is huge. There are a lot of really huge problems in the world and if you try to
tackle one you're going to just get lost. So it's really important to find some chunk of a problem that you can actually
solve. Selecting a model is something that you're going to probably have to do if you're doing something like this. And
this approach isn't the worst one. There's a lot worse you could do than to just try out a bunch of things and see what
works best. The final thing I learned is that it's more than just the model. You can have a really interesting model
that's good enough for a language learning game. And if you were trying to make a keyboard with it it wouldn't work very
well. This is something that if you're working in a company you'll probably want to work with the product people in your
company to decide what your users actually need, what they want, and what would be helpful to them. Try to gear your
model toward that. Because at the end of the day we're all trying to make software for human beings. &lt;/p&gt;
&lt;p&gt;Alright, let us move on to our second use case of the day. This is about forecasting energy load in the state of Texas.
So the problem here is if we pretend that I am operating a power grid I have to know the demand at various places in
order to be able to schedule the production of energy. We don't have excellent ways of storing energy for long periods
of time so you kind of have to get things scheduled to where they'll be used shortly after they're created. &lt;/p&gt;
&lt;p&gt;This isn't an entirely hypothetical problem. There is an organization that is known as the Energy Reliability Council of
Texas and it is their job to manage the energy market in this state. For those of you who are here from out of town,
Texas has (in most places) a deregulated energy market where power is generated and then sold on a market. And power
companies buy it up and then they resell it to consumers. I'm not here to debate the advantages or disadvantages of
regulation in the energy market. I think that would be a much longer talk if we were here for that. But the gist is that
in a lot of places (Austin being a notable exception) the energy market is deregulated and you have to know when the
when the demand is going to happen. &lt;/p&gt;
&lt;p&gt;So ERCOT publishes a lot of data. They published the last at least 10 years of data I think it's 14 on energy load on an
hourly basis in the various weather zones. So if we look at this, these colored regions are different weather zones
because it turns out that weather is the biggest factor affecting energy usage because air conditioners are a wonderful
blessing in my life but are also very expensive to operate. So they break this down on a weather zone by weather zone
basis and on an hourly basis and then they also provide a sum (but I didn't care about that as much). And you could plot
this on a graph and see over the last 14 years how much energy has been used on a daily basis in each of these different
weather zones. It's kind of interesting honestly even at this point just looking at this graph I mean like, oh people
are using more energy that's kind of an interesting thing to see and you can definitely see the seasonality of when
summer rolls around in Texas. People use a lot more energy that's kinda interesting. &lt;/p&gt;
&lt;p&gt;I think you are all prepared for what is next. What kind of machine learning problem is this? Regression! Yes,
wonderful! This side of the room is killing it--y'all need to work on your game. &lt;/p&gt;
&lt;p&gt;So a simple approach here to solve this regression problem is to just find the five nearest days to you and say that
we'll take the average of those. And that's basically a k-nearest neighbors model where you find the five data points
closest to a certain data point and average them together and then that is the output. And it turns out that because
this is a time series we can set the data point value (basically the input) to be the number of day in the year it is.
So for instance January 1st would be the first day of the year. Today is the hundredth day of the Year. Happy 100 days
of 2018 everybody. And you can set that to be the input value and then the output value being whatever the energy load
should be. This is sort of a time series data problem and being able to turn that into a regression problem is
interesting. There's actually been a lot of study around that and this is a simple approach that is reasonable to go
with (I think). When you're evaluating time series data there's a lot of things to consider. We obviously can still look
at the error rates like a simple error rate just take our predictions versus what actually happened and see the absolute
value of that and divide it by some reasonable-- like divided it by the actual number and then we can see we're off by 3
percent on average which is fine. I mean it depends on how accurate you're trying to get. &lt;/p&gt;
&lt;p&gt;But another specific thing you'll want to do when you're working with time series data is look at what's called the
residuals which is each of those individual data points where you take the predicted and the actual and you subtract
them. And the goal is that there isn't a pattern in there if there's any sort of seasonality and that as you look at it
you haven't quite fit the data as well as you could. The other thing you want to see is if your residuals resemble a
normal distribution. If they're skewed one way or the other then you may have made a mistake somewhere. There's a lot of
things I learned about this. The first thing I learned is that I did this in a very wrong way. You should really do a
lot of research about this stuff beforehand. There's actually a lot of research on how to do stuff with time series data
and the approach that I chose is actually not the most unreasonable way to do it but it isn't the best. There's a lot of
really good tools out there like Facebook has a tool called Profit that is built specifically for predicting time series
data and is used at Facebook we use it at my company and there's a lot of places that it's used and works really well.
These libraries do a really good job of taking into account common things that happen in time series data. For instance
holidays happen and energy usage is going to be way different on a holiday. So that's something to keep in mind.&lt;/p&gt;
&lt;p&gt;The other thing I'll say is scaling the features is important. So for this problem my features were the number of day
and the year that it was in were there were two input features and then I had my output. But the number of day it was is
a different scale than the year. The year runs from 2014 to 2018 or whatever. And that's a different range of values
than 1 to 366 potentially. That can cause problems when you're doing k-nearest neighbors stuff. Specifically, this is a
similar example. This is actually the credit card application data we were looking at earlier. But just to demonstrate
the problem when you don't scale your features. If we were trying to predict this yellow point on the left here you can
see it far to the left and then a little bit above the red X. We're going to be looking at the features around us to try
to find what point is closest to me that I can say that I'm going to be like that point. When we look at this with our
human eyes we say obviously the red X underneath it is the closest point. But it turns out that the way that the
features are scaled the net worth is such a larger number like just a bigger number than the age. &lt;/p&gt;
&lt;p&gt;Age only runs up to 1 to 100 ish. Net worth can be a much larger range and so that has a huge impact on the distance.
These numbers plotted next to each point on the left is the distance from the yellow point to the point associated
there. So you see that actually the closest point is this one that's 50000 units away. And so it gets classified-- you
can see on the right it gets classified as an "approve this application" even though it doesn't quite look like we
should. What you can do though, there's a thing in scikit-learn called a standard scaler and it will take these things
and scale them to to where the mean is zero and the standard deviation is one (which is really helpful in a lot of
circumstances). So when you look at this visually they look the same because they kind of are; it's just the actual
values have been scaled to a different range. And then when you look at the difference in how that ends up classifying,
it looks more like what we would expect to happen. So scaling your features is an important thing to do (especially when
you're doing something like k-nearest neighbors) but is also helpful when you're using other models. &lt;/p&gt;
&lt;p&gt;All right. This is a recent project that I've been working on to use machine learning to find your next job. The problem
that I ran into about a year ago was that I was passively job hunting.&lt;/p&gt;
&lt;p&gt;Basically, I wasn't out there actively knocking on doors and handing out resumes or anything. And I was reasonably
satisfied with where I was. But I was interested in hearing if there was a particularly excellent job out there that I
might want better. I couldn't find something that did exactly what I wanted, and it seemed like I was getting a lot of
noise coming through from just reading job listings. There were so many things that obviously I didn't want to look
into. So I was wondering if I could make this a machine learning problem. I ended up doing was scraping a bunch of job
listings. I would get the title and the company and then a link and then for a long time whenever I was bored I would
just go to this spreadsheet on my phone, click on the link, read the job description and then come back and say whether
or not I thought it sounded cool. I gathered a bunch of data like this and if I'm being honest with you I probably spent
more time reading job descriptions this way than I would have if I didn't build this. But because all of you are here I
don't think any of you have room to talk about over engineering something. So if I can't talk about my love for this
here I don't think there's any safe place for me. &lt;/p&gt;
&lt;p&gt;Anyway, you should be familiar with this question by now, what kind of machine learning problem is this? &lt;/p&gt;
&lt;p&gt;Classification! Yes, wonderful. Thank y'all. I heard someone say clustering. It's not quite clustering because we do
have a specific output variable that we're trying to predict. We want to know whether a given job sounds cool or does
not. &lt;/p&gt;
&lt;p&gt;The way that I ended up solving this is kind of tricky. So if we look at the other problems we've seen today they're all
numerical data, right? We have a day number and a year. Those are obviously numbers. We have a net worth and a loan
size; those are numbers. Age is a number; net worth is a number. How do we turn a job title into a number? Computers
can't deal with text. You can't just throw text at a computer and have it know what to do. You have to find some way to
turn it into a number. And when I ran into this I was thinking what am I going to do here. I have all this data about
text but I don't know how to fix that. So I turned to our trusty friend Google and I searched "text representations for
machine learning" which pretty much is the way that you should learn a lot of stuff is just search search it up. And I
found this. This is an idea that is a good thing to try first. It's not state of the art by any means, but it's a good
first pass. It's called a word count  vector, or people will call it a bag of words.&lt;/p&gt;
&lt;p&gt;And basically what you do here is you take all of your job applications, you find every word that occurs in any of the
job applications, and place those along the columns of a matrix. And then you place each job title along the rows of the
matrix, and then to fill in each slot in the matrix you look at the job title and the column and say, "Does the word in
the column appear in the job title on the row?" So for instance this first one: engineer does not occur in that title
but web does and applications does and senior does. And I'm not going to walk you through filling out this matrix
because even after 4 I'm a little bit bored of it. And one thing to note though is that while in job titles usually they
don't repeat words you theoretically could. So I put that last one I don't know who's going to post a job titled "Data
Data Data Data," but I'd be interested to hear about it. And so you can see where there' multiple occurrences of a word
it isn't just a one it can be you know four or whatever. So. That's basically how we can turn the text and the boolean
value into numbers. So that's this highlighted green part it becomes this series of numbers here and the highlighted
blue part becomes a number there. And then it's really surprisingly simple to do this stuff because a lot of this
functionality has already been built for us because there are giants upon whose shoulders we can stand and see much
further. &lt;/p&gt;
&lt;p&gt;So this is a really simple example where we take our rated jobs, pull out the titles, and then pull out whether or not
it sounded cool, and then scikit-learn has this tool called a count vector which will take text data and turn it into
those word count vectors I was talking about. And then we can take that data and put it into a model and fit it with the
preexisting "sounds cool" or "not" data. All we have to do then is just predict on the data, and we get out this array
that I've highlighted at the bottom that says, "OK the first job in the list you gave me doesn't sound interesting but
the fourth one does." So that's the model I ended up building. And originally I was just doing this in a Jupyter
notebook, and I was just running through it and I got a classification error of 19 you know around 20 percent was like
heck yes, I'm God's gift to data science. This is going to be amazing. And then what I realized was it was just saying
that everything didn't sound cool. And what I realized is most of the job listings I was reading didn't actually sound
that interesting. And so I would just rate them as they didn't sound cool and the model picked up on that and it's like,
"I can do super well if I just say nothing sounds cool." &lt;/p&gt;
&lt;p&gt;So I committed what's called the base rate fallacy. And this is something that's important to understand when you're
approaching a problem like this is to understand what would happen-- what are the underlying rates in these problems.
Because I wasn't actually improving on anything. I was just doing as well as literally just guessing zero every time. So
this is a self-portrait I drew after I discovered that I made this problem. Dealing with imbalanced classes like this is
fairly common. And so I wanted to provide a little bit of insight into good ways that people do this and the way that I
ended up doing this. The first thing that you can do is use better error metrics. The only way I realized that I was
having this problem is because I knew to look for the base rate (and now all of you know to do that). But these are
metrics that will help you understand your data in a little different way. There's no one metric that's going to be
perfect for every situation, but having a family of them can help you understand what's going on much better.&lt;/p&gt;
&lt;p&gt;Precision and recall are related concepts. In our case precision means how many of the job titles that I said sounded
cool actually are cool and recall means of all the job titles that do sound cool. How many am I saying sound cool. These
give you a better understanding of how you're doing in terms of like false negatives and false positives and stuff.&lt;/p&gt;
&lt;p&gt;The other thing that is useful is to use what's called the confusion matrix which you can see at the bottom here and
what you do there is you put the predicted values on one axis and the actual values on the other axis. And if I were to
do something like this I might see well I'm predicting zero-- and, actually, I filled this out wrong-- but I'm
predicting zero for everything, and I would notice that error much more quickly. &lt;/p&gt;
&lt;p&gt;Other than using better error metrics, one thing you can do is called "under sampling" and this is what I actually ended
up doing. I had 500 job titles (let's say) and only 100 of them sounded cool. I took all of those and then I took 100
randomly selected not cool job postings and I made a new dataset out of just those 200 and I trained a model on that and
that got me a much better accuracy rate for the jobs that did sound cool. Another technique that people do use is called
oversampling which is kind of the opposite. So if I have those 100 cool job postings I would take those and have four
copies of each of them, so that would give me 400 cool postings and then 400 not cool postings and I could just train my
model on all of that. I've never actually used that because I feel weird about doing that but it's something you can do
if you want to. &lt;/p&gt;
&lt;p&gt;So in the end what I ended up doing was getting this all running in essentially a cron job on a remote computer and it
will every week email me just a list of the top 10 jobs that sound the most interesting. So this is another thing where
we talk about how do we want to use this model. If I were to just have it spam me all the jobs it sounded cool that
would be more than I want to look at. But, because I chose a model that can predict the probability of something.
Logistic regression is able to tell you how probable it is that a certain job sounds cool, and I could just pick the 10
that sounded that had the highest probability of sounding cool. That gives me a much shorter list to look over each
week. &lt;/p&gt;
&lt;p&gt;So some lessons that I learned from this. Obviously the first one is understand the base rate because that can really
make you sad. The second thing is that doing something simple doesn't mean that it's going to be ineffective. Do any of
you watch The Office, or have any of you watched the Office? OK, so there's a scene in there where Dwight is talking
about Michael and Michael is his boss and his boss comes to him and he says, "K I S S keep it simple stupid. It's great
advice and it hurts my feelings every time." &lt;/p&gt;
&lt;p&gt;And that's kind of how I felt about it. I'm like I want to do something cool, I want to do deep learning, man! But it
ended up just being good enough, and using a very simple model worked. &lt;/p&gt;
&lt;p&gt;So the approximation generalization tradeoff is a theoretical concept from machine learning that can help us understand
why this works. And as you might guess from the name it means that if you have more approximation you're going to have
less generalization; if you have more generalization you're going to have less approximation. Those words don't really
mean anything so I drew a graph that will help us understand. Again, here I've made up some data. The blue line is the
truth. And then I sampled some points from it with a little bit of random noise in there to again model the real world.
What I did was I fit two different models to it. One of them is a simple model, linear regression, which you probably
learned in a high school algebra or precalc class. And then one of them being a more complicated model which can
effectively memorize any data set that it wants to. What you see here is that for the points that I'm showing right now,
the red model is killin' it. It knows every single spot; it has zero error on those. It is approximating the data set
extremely well. It knows the training data by heart. &lt;/p&gt;
&lt;p&gt;However, what you may notice is that when we add more data to this it does not do as well on those points. What you see
is the green model doesn't do as well on the input data as the red model does, but it does much better on the out of
sample data (on the testing data). So we have this tradeoff where more simple models are generally better at
generalizing even though they're worse at approximating. So that's sort of why it's a good idea to start out with
something really simple and basic and work up from there. The other good reason to do this is that it's easier.
scikit-learn is a &lt;code&gt;conda install&lt;/code&gt; away versus, I mean in my experience, setting up TensorFlow is hard. And even once you
get it set up training stuff can be sad and hard and long, and it's just a lot easier to start out with something simple
that you can iterate on quickly and learn and learn a lot about your problem space before you go into something more
complicated. &lt;/p&gt;
&lt;p&gt;So we've just talked about a good amount of things. This is sort of a summary that we can view some key concepts from
each of these use cases we've talked about. For the teaching a computer sign language, what we ended up doing was
support vector machines (which is a model that is useful). It's built into scikit-learn. In the forecasting energy load
in Texas data, it was time series data and what we found was using k-nearest neighbors worked really well. 
code:2:4&lt;/p&gt;
&lt;p&gt;However if you're doing time series data you should probably do some more research and probably use something like
Profit that's specifically built for time series data. Then the last use case we just talked about. If you run into text
data, it's at least worth trying Bag of Words. It has its caveats; it has its downsides, but it's a good first step. And
I ended up using logistic regression and that works really well and I get the email every week and I'm happy with it. So
it works pretty well. &lt;/p&gt;
&lt;p&gt;So basically there are some takeaways I have here. And then some recommended tools that we'll talk through. The big
takeaways being (from the very beginning) in supervised learning, we want to use past examples to predict a continuous
value in the case of regression or a discrete value in the case of classification. And those two correspond with
questions like "how much of this thing?" or "what kind is this thing?". And then another huge takeaway is to try the
simplest thing that could possibly work. This is something that my machine learning professor tried to beat into our
heads and has proven to be very effective in my experience. Once you have that simple thing that is kind of working you
can always test it out and iterate and maybe try a different model maybe try a different set of features and work from
there. &lt;/p&gt;
&lt;p&gt;We've been kind of light on recommendations about specific tooling but just if you want a jumping off point Jupyter
notebook is a great tool that lets you interactively run models and train them on various datasets and see how they look
kind of. There are some some plotting tools like matplotlib and Bokeh that will let you see into what the data sort of
looks like and can really help you get a better intuitive understanding for what's happening under the hood. Pandas is a
great library for manipulating tabular data which, actually, all the data we saw today was all tabular data in that it
had a set of rows and a set of columns. Pandas does a really good job of handling that kind of data. It can do things
like read from Excel spreadsheets and read from HTML tables and read from CSVs and whatnot. Obviously I recommend
scikit-learn. I used it for all of this stuff, and it's nice to not have to reimplement this stuff yourself.&lt;/p&gt;
&lt;p&gt;There's a lot more resources available if you're interested in this stuff. If you're interested in more of the
theoretical side I highly recommend a book called Learning from Data. It does a really good job of treating machine
learning theory with respect. A lot of times when we talk about machine learning it feels like we're just pulling stuff
out of a bag or pulling out a bag of tricks, and it's not really fair to think about it that way and there's a lot more
to it than that. This does a good job of helping you understand how that works. &lt;/p&gt;
&lt;p&gt;On the opposite side there is a blog called Practical Business Python that talks a lot about how to use these specific
tools and if you're hungering for more after this talk about how to specifically do stuff. He has a lot of great
resources about "how do I graph something? How do I read an Excel file?". It's really interesting, really good, solid,
extremely practical, detailed read there. Then the biggest thing I would say as far as gaining extra experience from
others is reading the Kaggle blog. They call it no free hunch (which is an adorable name) and they have a specific
section on it for winners interviews which is where-- there's all these people who compete in machine learning
competitions and then whoever wins they'll do an interview of them and say what did you do. Reading through those is a
huge amazing resource that I don't think is being taken advantage of enough because you can learn from some of the best
data scientists in the world about how they do their job and then apply those in your own work. If you're interested to
hear a little bit more detail on the sign language or machine learning to find your next job part, these are links, and
I'll tweet out the slides in just a little bit. &lt;/p&gt;
&lt;p&gt;I have more information on my website, samueltaylor.org, if you're curious about those things. I'm also happy to talk to
you afterward if you have any things you want to talk about. I do work for Indeed, and I would be remiss not to thank
them for their support of me doing this kind of work and talking about this stuff in front of people. If you are looking
for a job please come talk to me. We like data stuff. Beyond that, again I'm Samuel Taylor. I prefer communicating over
email over pretty much anything else so if you have a question you're obviously welcome to come talk to me right now but
if maybe you're a little shy or just don't want to talk feel free to email me. I love reading email. I might be the only
person who loves email like that. And then also I am happy to read people's tweets-- if you have questions I'm happy to
take those via Twitter as well. I'm @SamuelDataT. Would love to hear from you. &lt;/p&gt;
&lt;p&gt;Thank you so much for letting me talk to you and take this time out of your day. I appreciate it so much. I really hope
you're able to get something out of this if you have any other questions. We have about 5 minutes that I can take
questions if you have them or I'm also happy to talk about it after this, but thank you.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/machine-learning-crash-course.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Tue, 10 Apr 2018 17:54:24 GMT</pubDate></item><item><title>Work Queues in Software and Productivity</title><link>https://www.samueltaylor.org/articles/work-queues-software-productivity.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;My first job at a real software company was after my sophomore year in college. As you might imagine, I learned a ton
that summer. I learned about how companies organize themselves. I learned how software teams organize around different
pieces of the product. I learned specific technical things. I learned how to read code. I learned about navigating large
codebases. But the thing that that stuck with me most was a specific architectural decision that they had made: to
use what is called a work queue.&lt;/p&gt;
&lt;p&gt;In this article, we take a look at work queues in software and productivity. First, we'll examine what they are and look
into a look into a real world use case. After that, we'll talk about how to know when to use them. Finally, we'll think
beyond computers and apply them to improve our own personal productivity.&lt;/p&gt;
&lt;h1&gt;Explanation&lt;/h1&gt;
&lt;p&gt;To explain what a work queue is, let me give you a nontechnical example. Suppose that you are a high powered executive
named Alice, and your company has decided that you need an administrative assistant to handle writing emails, scheduling
calendar events, and other administrative tasks. They assign you an an executive assistant named Bob. You and Bob set up
a system for communicating what Bob needs to work on. As you're going about your day, when you run into something that
Bob can work on, you go to a large whiteboard. At the top of it you write, "Schedule meeting with Carol." And then you
go back to your desk and continue with your work. When something else comes up, you go back to the whiteboard and write,
"Email Dan to follow up on his project".&lt;/p&gt;
&lt;p&gt;As you're working at your desk, all Bob has to do in order to figure out what he should do next is look at the top of
the white board and find the next thing he hasn't completed. While you're writing an important document, Bob will go
ahead and schedule that meeting with Carol. Once he's done with that, he'll send that email to Dan. If at any point the
white board is empty when he tries to find a new task, he'll sit around twiddling his thumbs, not doing anything.&lt;/p&gt;
&lt;p&gt;What we've described here is the core of what a work queue implementation could look like. There would be a few
differences, obviously. Instead of an executive named Alice, we have a process A. This could be an Android application,
a Ruby on Rails web app, or what have you. This application hums along serving content to the user, and then
occasionally it sends out some work to another process. This is just like the way you would write on the white board for
Bob, but instead of a human person named Bob, we have a "worker process" that we'll call Process B. Process B can handle
stuff in the background as process A is interacting with the user.&lt;/p&gt;
&lt;p&gt;The whiteboard in our example essentially &lt;em&gt;is&lt;/em&gt; the work queue itself. The idea is that process A can offload tasks to
process B.&lt;/p&gt;
&lt;h1&gt;Use case&lt;/h1&gt;
&lt;p&gt;Here's a real world example. A lot of times when you sign up for an account on a website, you're interacting with some
web application that they have running. When you register for an account, you enter a username, a password, and your
email address and hit a button that says "create account". One of the things that has to happen when you sign up for an
account is they have to confirm that that's actually your real email address by sending you a confirmation email. One
way they could do this is to have the application that is handling that sign up form just go ahead and send the email.
The problem with doing that is that sending email can take a while. If the wep application itself sends the email, it's
not going to be able to send content to your screen while it's doing that. So you'll be sitting there wondering, "Why is
this page taking so long to load?". This is a terrible user experience! If we take a step back and think, there's
nothing essential about sending that email right then as the page loaded. We can really offload that to some other
process.&lt;/p&gt;
&lt;p&gt;To improve the user experience, we can use a work queue.  Rather than sending that confirmation email during the page
rendering process, we could put a piece of work in the queue to send that email. Of course, we'll keep the essential
application logic of creating a user in the database, but that happens pretty quickly, especially compared with the time
it takes to send an email.&lt;/p&gt;
&lt;p&gt;How would we implement such a system? Well, a common way is to have our main process produce messages to some sort of
message queue. There's lots of options here, but two common choices I've personally used are RabbitMQ and Apache Kafka.
Then, in our main process, we'll produce messages to that system. The only thing we're missing now is what we've called
"Process B", or our "worker node". For this, we create a process that reads messages off the queue and does work based
on those messages. As an aside, if you're using Python, definitely checkout &lt;a href="https://www.celeryproject.org/"&gt;Celery&lt;/a&gt;.&lt;/p&gt;
&lt;h1&gt;Benefits&lt;/h1&gt;
&lt;p&gt;So why do I love work queues so much? The first reason is that they give us a chance to decouple things. Because we're
creating an interface wherein Process A just has to say, "Hey, I would like this thing to happen", that same process
doesn't have to know anything about the other process. Let's say Process A is a Python application. We probably start
off with workers written in Python, too. If we're smart about using language-agnostic serialization, we give ourselves
more flexibility in the future. If we need some library that's only available in Java or have some task that's really
best suited to Haskell, we can create worker processes in those languages. This gives us a lot of flexibility to choose
the right tool for the job.&lt;/p&gt;
&lt;p&gt;The other nice thing about using a work queue is that it can make scaling easy. While we've been calling it process B,
it doesn't have to be a single process, or even a single server. Say one day a ton of people start signing up for our
website. Our work queue starts getting longer and longer as Process B isn't able to keep up with all the work it needs
to do. An easy way to handle this issue is to start up 10 more instances of Process B. One nifty thing we can do is
dynamically scale the number of workers we have based on how many things are still in the queue. If our workers start to
fall behind, spin up a few more instances. If the queue is frequently empty, spin a few down.&lt;/p&gt;
&lt;h1&gt;When to use&lt;/h1&gt;
&lt;p&gt;Let's talk about when to implement a work queue. The key insight to note is that when we find a piece of work that is
easily parallelizable, that's a good candidate for this kind of system. In other words, if we encounter a problem where
we can break apart a large task into a number of similar subtasks, we could likely put those tasks into a queue. For
example, we might want to scrape a bunch of webpages. To do this, we could create a message that includes the URL of the
page we want to scrape and says, "Hey, scrape this thing". Then, we have one process spit URL's into the queue, and a
number of processes reading from that queue, scraping pages, and storing results in the database.&lt;/p&gt;
&lt;h1&gt;Productivity technique&lt;/h1&gt;
&lt;p&gt;Beyond being a nifty technical tool, I've been able to find applications for this in my working life.  That example at
the beginning (where Alice was farming out work to Bob) is actually pretty similar to how I operate day-in and day-out.
Except instead of farming out work to an administrative assistant, I'm farming out work to future me. Basically, when I
encounter something that is a little chunk of work that I know I can do later and that is going to knock me off task
right now, I write it down in a list.  I set a specific time each day to go look at the list and knock out all the
things I need to do. This technique has helped me be more productive, because batching little tasks like this all
together means that during the course of my day, I make fewer costly context switches between deep, analytical tasks and
more administrative tasks.&lt;/p&gt;
&lt;p&gt;Further, I have found that the amount of context in those analytical tasks is usually much greater than the context for
an administrative task. That means that if we group the administrative tasks together, switching between them may still
result in the same number of context switches, but they are each less costly.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;I will forever be grateful to the people I worked with during that summer four years ago. Getting exposure to common
patterns and concepts has been immensely helpful in my work as a software engineer, and I hope that hearing about this
idea will help you solve a problem some day.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/work-queues-software-productivity.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Wed, 28 Feb 2018 00:37:51 GMT</pubDate></item><item><title>Using DISTINCT ON in Postgres</title><link>https://www.samueltaylor.org/articles/postgres-distinct-on.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Every once in a while, I'll have a need to do a one-to-many join, but keep only a certain row in the "many" table. For
instance, say we have a system to track inventory in retail stores:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;store_id&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;item_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;School Supplies R Us&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Grocery Mart&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Backpack&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Pencil&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Pen&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Egg&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Flour (lb.)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can get the inventory for all stores easily enough.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;         name         | quantity |  item_name
----------------------+----------+-------------
 School Supplies R Us |        1 | Backpack
 School Supplies R Us |       12 | Pencil
 School Supplies R Us |        4 | Pen
 Grocery Mart         |       12 | Egg
 Grocery Mart         |        1 | Flour (lb.)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But what if we only want to get the item with highest quantity from each store? Fortunately, Postgres has a syntax that
makes this easy.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;store_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;         name         | quantity | item_name
----------------------+----------+-----------
 School Supplies R Us |       12 | Pencil
 Grocery Mart         |       12 | Egg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;What does &lt;code&gt;DISTINCT ON&lt;/code&gt; do? Well, it selects the first row out of the set of rows whose values match for the given
columns. The first row is arbitrary unless we pass along an &lt;code&gt;ORDER BY&lt;/code&gt; statement. Note that we have to include the
columns from the &lt;code&gt;ON()&lt;/code&gt; clause in our &lt;code&gt;ORDER BY&lt;/code&gt;. If we don't, we get a helpful error message:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: SELECT DISTINCT ON (store_id) name, quantity, item_name
                            ^
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you run into a situation wherein you need to choose a specific row in a group based on some rules, try using
&lt;code&gt;DISTINCT ON&lt;/code&gt;. For more detail, check out the &lt;a href="https://www.postgresql.org/docs/10/static/sql-select.html#SQL-DISTINCT"&gt;Postgres documentation&lt;/a&gt;.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/postgres-distinct-on.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Mon, 12 Feb 2018 00:39:08 GMT</pubDate></item><item><title>Work-Self Balance</title><link>https://www.samueltaylor.org/articles/work-self-balance.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;It's a few months ago. I'm enjoying my job and trying to bring value to my team. A high-visibility project comes up.
Despite being unfamiliar with the framework we plan to use, I'm excited to work on this project. I set out to "learn by
doing", implementing a chunk of the project with this framework. After reading some code, writing some new code by trial
and error, and asking a lot of questions, I'm able to get this chunk finished. More exciting than that, my
implementation serves as a model for much of the rest of the project. Being that I know the framework well at this
point, I'm teaching the rest of the team how to use it. I love knowing that I'm bringing so much value to the team!&lt;/p&gt;
&lt;p&gt;Then, I submit a piece of code for review. One of my team members notices a flaw in it. Immediately, my thoughts rush to
how I can justify the flaw. Underlying this reaction is the belief that this flaw reveals my own incompetence. I start
to type a response. But as I write, I realize that my coworker is right. I thought I knew everything there was to know
about this framework, but clearly I don't. I delete my response and fix the problem instead.&lt;/p&gt;
&lt;p&gt;Since then, I've found a healthier way to think of my work. Let's examine the origins of and problems with my initial
belief so that we can find a healthier alternative.&lt;/p&gt;
&lt;p&gt;The core belief I identify as an issue in the above story is a lack of separation between my work and my self. Creating
software can be a deeply personal enterprise. When I'm in the zone, it can feel like the code I'm writing somehow
emanates from my being rather than that I am actively writing it. Given this understanding, criticism of my work is also
criticism of myself, my character, and my abilities.&lt;/p&gt;
&lt;p&gt;This belief (though not one I consciously chose) is harmful. If criticism is painful, human nature says to avoid it.
Unfortunately, avoiding criticism means avoiding learning because the best learning can come from making mistakes and
fixing them.&lt;/p&gt;
&lt;p&gt;We can choose a healthier relationship to our work. Specifically, I find it helpful to mentally separate my code from my
sense of self. In other words, I avoid tying my ego up in my work outputs.&lt;/p&gt;
&lt;p&gt;This mental model is more true to the realities of software development. On a daily basis, I am faced with countless
constraints. Perhaps I must complete a task within a given timeframe. Maybe I have to use a specific tool. These
constraints mean that the work I produce cannot be considered to be solely a reflection of my character or abilities. In
some way, the work also embodies the constraints I was under while creating it. Given an infinite amount of time and
resources, I'm sure all of us would create impeccable and beautiful software. However, in a world of constraints, our
work is less likely to be perfect.&lt;/p&gt;
&lt;p&gt;How can we rein in our ego? I've noticed that as soon as I am aware that my ego is flaring up, it's easy to see how
silly I'm being. To this end, I've found two activities helpful: journaling and meditation.  Journaling consists of
sitting down in the evening a few times per week and writing about what has happened in the preceding few
days&lt;sup&gt;1&lt;/sup&gt;. By intentionally recalling and reviewing my actions, I'm able to be more objective in my view of
myself.&lt;/p&gt;
&lt;p&gt;Meditation also helps foster the lense of an unbiased observer. When I meditate, I get deliberate, intense practice at
noticing my feelings. Bringing this awareness to the rest of my day becomes easier as I continue this practice. And this
awareness allows me to notice when my ego acts up so that I can respond accordingly&lt;sup&gt;2&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;When I'm able to separate my work from my self, I can respond more productively to criticism of my work. I can see that
the criticism isn't aimed at me. Instead, something I produced was found to be lacking in some way. Previously, I might
waste energy either beating myself up or trying to justify a mistake. But now I can focus on making the necessary
improvement.&lt;/p&gt;
&lt;p&gt;I still love bringing value to my team. But I've realized how crucial it is to stay humble and how valuable it is to
understand that my work is separate from my self. Making a mistake reveals not that I am incompetent, but that I'm
human. And really, aren't we all?&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;1&lt;/strong&gt;: If you're interested in journaling, I highly recommend Timothy Wilson's
&lt;a href="https://www.goodreads.com/book/show/11516274-redirect"&gt;&lt;em&gt;Redirect&lt;/em&gt;&lt;/a&gt;, which explores a bunch of interesting research in
self narrative.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2&lt;/strong&gt;: I'm sure there are other ways of increasing this kind of self-awareness. &lt;a href="mailto:sgt@samueltaylor.org"&gt;Let me
know&lt;/a&gt; if you have any tips of what has worked for you in the past.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/work-self-balance.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Wed, 03 Jan 2018 12:58:11 GMT</pubDate></item><item><title>Poetry for Software Engineers</title><link>https://www.samueltaylor.org/articles/poetry-for-software-engineers.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;&lt;img src="/static/img/poetry_fountain_pen.jpg"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Photo by &lt;a href="https://unsplash.com/photos/hjwKMkehBco"&gt;Álvaro Serrano&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I am not a poet. But when I was barely three years old, I was sitting at the kitchen table with my mother and older
sister. They were working on rhyming words, because that is apparently the stage of childhood development my sister was
in. My sister was struggling a little. My mom would ask her, "What's a word that rhymes with ball?" My sister replies,
"bend." Mom says, "No, we're trying to make the end of the word sound the same. So a word that rhymes with ball would be
fall. What are some words that rhyme with sky?" and my sister, with a confused look on her face, replies, "Soon?
Sleep?". Mom says, "No, remember, we want the &lt;em&gt;ends&lt;/em&gt; of the words to sound the same, not the beginning.  So 'fly' or
'pie'. What are some words that rhyme with moon?" she asks. My sister says, "My?" and to hear my mom tell the story, my
little voice pipes up, "Spoon! Loon! June!" which of course prompts my sister to give me an angry look and shout, "Shut
up, Sam!"&lt;/p&gt;
&lt;p&gt;I learned two things that day. First, it is immensely enjoyable to pick on one's older siblings. Second, I learned that
words are interesting. They're fun. I like them. I like the way they sound when you say them and the power they give to
express ideas.&lt;/p&gt;
&lt;p&gt;In the time between that story and now, I've grown to enjoy poetry. This comes as a surprise to some who know me as a
very analytically-minded person. And I get it–I mean, I'm a software engineer. Still, I've found a lot of joy in
poetry. Beyond that, I think it's made me a better engineer.&lt;/p&gt;
&lt;p&gt;Today we're going to talk about why I think you should read poetry. To do this, we're going to talk about how to read a
poem, the ways that beauty is related to software, and a few things we can learn from poets.&lt;/p&gt;
&lt;p&gt;So how &lt;em&gt;do&lt;/em&gt; we read a poem? First, we need to be sure we're in the right mindset. In our culture, it's common for us to
decide whether we like something immediately. For example, a friend of mine went to see the movie Wonder Woman, and I
asked him whether it was a good movie. He replied, "Yeah, I liked it." But that's not exactly what I was asking; the
quality of a piece of art or media is different from an individual's opinion about that piece. When we read poetry, it's
important to understand what it's doing before we decide whether we like it or not. If you decide to read some poetry
after this, I encourage you to read charitably. Try to understand what the author is saying and how she or he is saying
it before you do anything else.&lt;/p&gt;
&lt;p&gt;Once we're in the right mindset, we can dive into reading. For me, it's helpful to have a bit of a process to go
through, and I imagine it can make this a little less weird if you are a so-called "left brain" person. I typically read
through the poem three times:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;On the first reading, I'll pick up a pencil and read through the poem in my head. When I get to any words that I don't
  know the exact dictionary definition of, I circle them. Then I look up these words in a dictionary. Poets spend a lot
  of time choosing the words they use, so it's important to understand what those very specific word choices mean.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Step two is to read the poem out loud. As I'm reading through, when I notice any words that stick out to me, I draw a
  little dot next to them. At this point, I'm not worrying about why they stick out, just keeping note that they do.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;On my third reading, I look for allusion to other work, metaphor, imagery, and other literary devices.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once I've done these three readings, I'm in a better spot to understand what the poet is saying. Let's do a practice
run. This is a poem by Langston Hughes called "The Negro Speaks of Rivers".&lt;/p&gt;
&lt;p&gt;I’ve known rivers:&lt;br /&gt;
I’ve known rivers ancient as the world and older than the&lt;br /&gt;
    flow of human blood in human veins.  &lt;/p&gt;
&lt;p&gt;My soul has grown deep like the rivers.  &lt;/p&gt;
&lt;p&gt;I bathed in the Euphrates when dawns were young.&lt;br /&gt;
I built my hut near the Congo and it lulled me to sleep.&lt;br /&gt;
I looked upon the Nile and raised the pyramids above it.&lt;br /&gt;
I heard the singing of the Mississippi when Abe Lincoln &lt;br /&gt;
    went down to New Orleans, and I’ve seen its muddy &lt;br /&gt;
    bosom turn all golden in the sunset.&lt;/p&gt;
&lt;p&gt;I’ve known rivers:&lt;br /&gt;
Ancient, dusky rivers.&lt;/p&gt;
&lt;p&gt;My soul has grown deep like the rivers.&lt;/p&gt;
&lt;p&gt;As I read through this, I circled the words "Euphrates" and "Congo"; I know they're rivers, but I'm not 100% on where
they actually are or what they mean. I also circled "dusky", because I don't know the exact definition. Let me look
these up.&lt;/p&gt;
&lt;p&gt;OK, "dusky" means "somewhat dark in color; &lt;em&gt;specifically&lt;/em&gt; : having dark skin". The Euphrates is a river in modern-day
Iraq, and the Congo is in Africa.&lt;/p&gt;
&lt;p&gt;Second reading. In the line "older than the flow of human blood in human veins", I put dots next to both occurrences of
the word human. The word "bathed" in "bathed in the Euphrates" also sounded interesting, so I dotted that. I also dotted
the word "golden" in "I've seen its muddy bosom turn all golden in the sunset".&lt;/p&gt;
&lt;p&gt;Time for the third reading. On this pass, I underline "the Euphrates" as an allusion to the Biblical creation narrative.
"The singing of the Mississippi" is an interesting image, as is "muddy bosom turn all golden in the sunset."&lt;/p&gt;
&lt;p&gt;Now we're in a better place to understand what this all means. Langston Hughes lived in a deeply segregated America, and
I find him making a powerful argument for a sense of common heritage and equality among all people. When he uses the
word "I", he's referring to himself–a black man living in the Jim Crow society of the time. To say "&lt;em&gt;I&lt;/em&gt; bathed in the
Euphrates" is to paint a picture of the first man in the Bible as black. I imagine this was a shocking image to the
society of the time, which had a "One Drop Rule" that meant if a person had even one drop of black blood in them, they
were subject to the brutal segregation of Jim Crow. In this poem, Hughes shows that we are all human, thereby
undermining the subjugating logic of Jim Crow.&lt;/p&gt;
&lt;p&gt;I find this poem to be beautiful. Its construction is clearly well thought out, and the story it tells is a persuasive
argument against segregation. But it may seem like we're on a bit of a rabbit trail here–how does any of this relate to
software? Allow me to answer your question with a story.&lt;/p&gt;
&lt;p&gt;Have you ever heard of "fizz buzz"? It's a well-known interview question in which you're supposed to print out the
numbers 1 through 100, except when a number is divisible by 3 print "fizz", when it's divisible by 5 print "buzz", and
if it's divisible by both print "fizz buzz". So the sequence goes 1, 2, fizz, 4, buzz, fizz, 7, 8, fizz, buzz, 11, fizz,
13, 14, fizz buzz, and so on.&lt;/p&gt;
&lt;p&gt;The specific code to solve this problem could take a variety of shapes, but a straightforward Python implementation took
around 9 lines. By contrast, there's a satirical GitHub repository called &lt;a href="http://www.fizzbuzz.enterprises"&gt;Fizz Buzz Enterprise
Edition&lt;/a&gt; that consists of 1,387 lines of Java code spread across 89 files. Fizz Buzz
Enterprise Edition is an exercise in using every design pattern you possibly can regardless of whether it actually
improves the code.&lt;/p&gt;
&lt;p&gt;I gave a presentation on the subject of poetry to some coworkers at one point, and I showed them the 9-line Python
script followed by a single one of those 89 files in the Enterprise Edition. I asked them, "Which of these codebases is
better?". The response was unanimous, of course: the Python script was better. When I asked them why they thought so, it
took only a few seconds for someone to say the word "elegant." &lt;/p&gt;
&lt;p&gt;And therein lies the answer to your question. When we start talking about what makes good code, the discussion quickly
gets to the idea of "elegance", which I would say is a specific way of talking about beauty. In the software industry,
we like to think of ourselves as innovators creating novel inventions, but when we start talking about beauty, we're
incredibly late to the party. While some in this industry lambaste the humanities as useless, human beings have been
trying to understand beauty for thousands of years. We would be foolish to throw away all we've learned.&lt;/p&gt;
&lt;p&gt;Being a great engineer involves writing great software. And ideally, our software is elegant and beautiful. To get a
sense for what those terms mean, we can turn to poetry. I've found that as I read more poetry and understand how it's
constructed, I'm able to apply that knowledge to structuring my team's software. Understanding the way that a poet
specifically chooses her or his words to fit the intention is fascinating and informs the way I choose to name
functions, variables, and classes.&lt;/p&gt;
&lt;p&gt;This takes us into our third point: we can learn a lot from poets. One day, the poet Ezra Pound was sitting in a subway
station in Paris. As he looked around at the people walking through the station, he was overcome by a unique feeling.
Like any good poet, he strived to capture that emotion in a poem. His first version was 30 lines. After six months, he'd
carefully crafted and whittled it down to 15 lines. A year later, he published the poem &lt;em&gt;In a Station of the Metro&lt;/em&gt;,
which reads:&lt;/p&gt;
&lt;p&gt;The apparition of these faces in the crowd;&lt;br /&gt;
Petals on a wet, black bough&lt;/p&gt;
&lt;p&gt;Whether you like this poem or not, it's impressive in its ability to convey an image and an emotion by connecting two
seemingly unrelated images. In the process of slowly and carefully carving away the excess and cruft from the poem,
Pound is able to compress a lot of information into fourteen words.&lt;/p&gt;
&lt;p&gt;We should strive to be more like poets. Good poets are able to express their ideas in precise language that communicates
clearly. A huge part of the way they do this is by choosing their words carefully. This process is directly applicable
to our work creating software. We're trying to encode some real-world process or use case into code that communicates
our intentions to future maintainers. If we're sloppy with the way that we name things or structure our programs into
functions, classes, or packages, the things we create will do a much worse job communicating the idea we have in our
head to future maintainers.&lt;/p&gt;
&lt;p&gt;Poetry is a useful hobby for software engineers. By helping us to better understand beauty and communication, poetry
helps us develop skills that are directly applicable to creating good software.&lt;/p&gt;
&lt;p&gt;I want to leave you with one last poem. Before I read this, I'm going to ask you to do something. I'm not trying to be
manipulative, but I'm hoping to help you understand this author better. Take a few moments and think of someone who
means a lot to you; it could be a family member or a close friend, but think about all the things that they mean to you
and ways they improve your life.&lt;/p&gt;
&lt;p&gt;The Gate, by Marie Howe&lt;/p&gt;
&lt;p&gt;I had no idea that the gate I would step through&lt;br /&gt;
to finally enter this world&lt;/p&gt;
&lt;p&gt;would be the space my brother's body made. He was&lt;br /&gt;
a little taller than me: a young man&lt;/p&gt;
&lt;p&gt;but grown, himself by then,&lt;br /&gt;
done at twenty-eight, having folded every sheet,&lt;/p&gt;
&lt;p&gt;rinsed every glass he would ever rinse under the cold&lt;br /&gt;
and running water.&lt;/p&gt;
&lt;p&gt;This is what you have been waiting for, he used to say to me.&lt;br /&gt;
And I'd say, What?&lt;/p&gt;
&lt;p&gt;And he'd say, This—holding up my cheese and mustard sandwich.&lt;br /&gt;
And I'd say, What?&lt;/p&gt;
&lt;p&gt;And he'd say, This, sort of looking around.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Thanks for reading. I would love for you to tweet your favorite poem to me; I'm
&lt;a href="https://twitter.com/SamuelDataT"&gt;@SamuelDataT&lt;/a&gt;. Now get out there and read some poetry!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/poetry-for-software-engineers.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 10 Dec 2017 22:33:25 GMT</pubDate></item><item><title>Monte Carlo Simulation with Categorical Values</title><link>https://www.samueltaylor.org/articles/monte-carlo-categorical.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;We live in a world of imperfect information. When faced with a lack of data, we can make a guess. This guess could be
far from the truth; it could be spot on. In Monte Carlo simulation, we repeatedly make guesses of some unknown value
according to some distribution and are able to report on the results of that simulation to understand a little bit more
about the unknown. While any one guess may be far from the truth, in aggregate those outliers don't have as much of an
effect.&lt;/p&gt;
&lt;p&gt;I ran into a situation where I was gathering some data with some level of imperfection. My stakeholder wanted to know
what the impact of that imperfection on the important metrics would be. I could have made a guess, but instead I turned
to the data. Initially, I thought to calculate the best case and worst case scenarios. This idea is useful in that it
gives you a range on what you don't know, but it's also beneficial to know how likely each of those scenarios (and
things in between) are. That's where Monte Carlo simulation comes in handy.&lt;/p&gt;
&lt;p&gt;For the purposes of context, let's use a contrived example. Suppose I run a car dealership, and a major hail storm
rolled through last weekend. Some of my cars suffer major damage and will incur a 10% loss in their value, some suffered
minor damage incurring a 5% loss, and some suffered no damage at all. I don't have enough labor to survey every single
one of my cars, but I do want to know how much money I can expect to lose. I randomly select 500 of my 1000 cars and see
how bad the damage was.&lt;/p&gt;
&lt;p&gt;Hypothetically, it's possible that the 500 cars I inspected were the only ones that happened to have been damaged by the
hail (maybe the rest were safe in my 500-car warehouse). That would be the best case scenario, in which case I suffer
only the loss on the cars I inspected. In the worst case scenario, every car I didn't inspect suffered major damage. For
some random data I generate below, we know that the amount of damage done to the uninspected cars is somewhere between 0
and around $1.75 million dollars. This is a huge range of possibilities!&lt;/p&gt;
&lt;p&gt;We see that looking at the best case and worst case gets us some bounds on how bad the damage could be, but we have no
idea how probable each of these options are (let alone the probability of the values in the middle). To find out how bad
the damage is likely to be, we can turn to Monte Carlo simulation.&lt;/p&gt;
&lt;p&gt;First, let's randomly generate our inventory, then inspect half of our cars. While we know the &lt;code&gt;value&lt;/code&gt; of each car, we
think of the &lt;code&gt;damage_pct&lt;/code&gt; as an unknown value for the cars we do not inspect.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;N_CARS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;cars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;value&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;35000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;N_CARS&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;damage_pct&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;N_CARS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;inspected_cars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;uninspected_cars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cars&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inspected_cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="n"&gt;damage_dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inspected_cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;damage_pct&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
               &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;inspected_cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;value&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;prob&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can see the distribution of &lt;code&gt;damage_pct&lt;/code&gt; among the sampled cars is&lt;sup&gt;1&lt;/sup&gt;:&lt;/p&gt;
&lt;table&gt;
&lt;tr&gt;
  &lt;th&gt;damage_pct&lt;/th&gt;
  &lt;th&gt;prob&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;-0.1&lt;/td&gt;
  &lt;td&gt;0.096&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;-0.05&lt;/td&gt;
  &lt;td&gt;0.490&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;0.00&lt;/td&gt;
  &lt;td&gt;0.414&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;Because we inspected a random subset of the cars, a reasonable simplifying assumption is that the damage to the
uninspected cars has the same distribution as that of the inspected cars. With that assumption, we can simulate the
damage done to the uninspected cars like so:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;simulate_damage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;damage_dist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uninspected_cars&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;damage_pct_guess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;damage_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                        &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;uninspected_cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                        &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;damage_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;damage_pct_guess&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;uninspected_cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;N_SIMULATIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;simulated_damages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;simulate_damage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;damage_dist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uninspected_cars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                               &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N_SIMULATIONS&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To make sense of the output of these 1000 simulations, we can calculate some descriptive statistics. It's also helpful
to look at the CDF of the simulated damage.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;simulated_damages&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;simulated_damages&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hist&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cumulative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;CDF of estimated damage&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;count      1000.000000
mean    -595662.074433
std       25097.417355
min     -671499.963571
25%     -613690.059583
50%     -595382.298010
75%     -576089.086686
max     -524043.037134
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src='/static/img/mcs_cdf.png'&gt;&lt;/p&gt;
&lt;p&gt;We can say that in half of our simulations, the damage was somewhere between $671,499.96 and $595,382.30. This range is
about 4.3% the size of the range between the best and worst case scenarios.&lt;/p&gt;
&lt;p&gt;How'd we do? Because we made up the dataset, we can calculate the true value and put that number on the graph above:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;true_damage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uninspected_cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;uninspected_cars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;damage_pct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;annotate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_damage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;xycoords&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;xytext&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_damage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;textcoords&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data&amp;#39;&lt;/span&gt;
             &lt;span class="n"&gt;arrowprops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;facecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;red&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headlength&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src='/static/img/mcs_cdf_wtruth.png'&gt;&lt;/p&gt;
&lt;p&gt;In "reality", the true amount of damage done was $607,830.94, which just so happens to be in that window of 50% of our
simulations.&lt;/p&gt;
&lt;p&gt;We can run this experiment a few more times to see how this method fares:&lt;/p&gt;
&lt;p&gt;&lt;img src='/static/img/mcs_many_cdf.png'&gt;&lt;/p&gt;
&lt;p&gt;The next time you're trying to reason about some unknown value, consider using Monte Carlo simulation to inform your
decision-making process.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;1: While the values for &lt;code&gt;damage_pct&lt;/code&gt; look like numerical data, remember that they are representative of the three
categories of damage sustained (none, minor, or major).&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/monte-carlo-categorical.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 03 Dec 2017 15:55:29 GMT</pubDate></item><item><title>Use Machine Learning to Find your Next Job</title><link>https://www.samueltaylor.org/articles/use-machine-learning-to-find-your-next-job.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;div class="embed-responsive"&gt;&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/HR1ptrLMxA0" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;/div&gt;

&lt;p&gt;Delivered at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.devspaceconf.com/"&gt;DevSpace&lt;/a&gt; on 14 Oct 2017. Slides available &lt;a href="/static/pdf/find_job_ml.pdf"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://datascicon.tech/"&gt;DataSciCon&lt;/a&gt; on 30 Nov 2017. Slides available &lt;a href="/static/pdf/datascicon_find_job_ml.pdf"&gt;here&lt;/a&gt;.
  Video available &lt;a href="https://www.recallact.com/presentation/use-machine-learning-find-your-next-job"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Transcript&lt;/h2&gt;
&lt;p&gt;A wise man once said, "I got ninety-nine problems," and I can relate to that in some sense. Because on a day to day
basis I run into problems; I run into things that aren't as easy as they should be or things that I want to be better.
And I suspect that because you are all in this room on a Saturday you also have problems, you also run into things and
want to use software to solve them. Today I want to talk about ways that we can use software to solve our problems and
specifically to give those software solutions some intelligence using data.&lt;/p&gt;
&lt;p&gt;Now the motivating example and the one for which this talk is titled is a job search helper thing that I made. Basically
what happened was a few months ago I was passively job searching, which is to say that I wasn't actively out there
knocking on people's doors and handing out résumés, but I was curious to see if there were any particularly excellent
jobs in my area. I went out and tried to sign up for different job alert things that would give me the coolest jobs and
I couldn't find anything that did exactly what I wanted it to do. Like any good Engineer I decided to build it myself. I
built this little email newsletter that I would send to myself every week that essentially had the coolest sounding jobs
in my area. I could go through and just review those jobs. It was basically a way to filter out a lot of the noise. And
so we're going to use this as sort of a case study today, to talk about a process that I've gotten to use a few times
that has worked for me. I wanted to share it with you all to hopefully provide some value in your own lives.&lt;/p&gt;
&lt;p&gt;So this is how we're going to be doing that, the astute among you have probably noticed we are currently in the
introduction. After that we're going to talk about asking the right question; basically phrasing questions in ways
computers can help us answer them. Once we do that we'll talk about ways to gather the data, and then we'll analyze the
data. Finally, we'll deploy the insights that we gather, and here I don't mean deploy in the sense of we're going to put
the code on a server somewhere. That part is interesting, but more relevant to this discussion is how do we get a number
to be something interesting to a person and express in a way that people can understand it.&lt;/p&gt;
&lt;p&gt;So as for me I'm originally from this part of California (Bakersfield, CA). It's the really boring part, but it was a
great place to grow up, and then I left and went to Baylor University where I studied Computer Science. I really enjoyed
my time there, and while I was there I sort of got bit by the data bug and got to do some research in an autonomous
drone lab that was really exciting, do some research with collaborative filtering (which is a recommender systems kind
of thing) that got me started down the path of the thing that I ended up building for this.  Another relevant thing I
got to do while I was there was I taught a computer sign language which was really fun. And then over the summers I
would go do internships in various places and learn a lot about software and good engineering practices. I tried to
unify this all together, and then after I graduated went and started doing some data engineering work, and now I
actually work at Indeed (which is the world's largest job site).  Interestingly enough literally everything that I'm
going to talk about today is completely unrelated to my job there (other than the fact that I do data stuff there too)
but all the code that I wrote for this I actually wrote while I was at my prior company. But the most important thing on
this map is the fact that we're all here in this room today to talk about this stuff, and I'm really glad that you are
all able to come and honored by your presence here today.&lt;/p&gt;
&lt;p&gt;So that's all the boring stuff–let's get in to the cool part! The first step in this process is to have a problem. This
is the easiest step because it's the one that just comes naturally. It's the one that you bump into on a day by day
basis. For me, I bumped into the problem that job alerts are too noisy. There's too many jobs for me to reasonably look
over in a short amount of time. I've also run into problems where I was trying to figure out how to get my energy bill
lower or trying to figure out how to get home from work faster. Once you have a problem the next thing you're going to
want to do is start to think about solutions. In order to do that, you need to understand the ways we can ask computers
questions and get useful answers back from them, which leads us to the fun buzzword of the day: Machine Learning.&lt;/p&gt;
&lt;p&gt;If you have heard the phrase "Machine Learning" please raise your hand. Alright, yeah it's a buzzword we've all heard.
If you feel like you have used machine learning in a substantial or interesting way, raise your hand. Awesome, some more
great hands. Alright, one last question: if the phrase approximation-generalization tradeoff means anything to you,
please raise your hand. That's fine we'll talk about it later, I just wanted to know what to go into. Thanks for your
participation there.&lt;/p&gt;
&lt;p&gt;So what is machine learning? There are a few different things that comprise it, and I'm going to talk about a subset of
it today. There are a few kinds of algorithms broken into a lot of different categories, but this is good enough for
today. There's a type of algorithm called a supervised algorithm in which you are basically feeding training data into a
computer. That training data has a number of features that are like input values and then a number of output variables.
Here on this graph what you see is that the X axis is age and the Y axis is net worth. The example problem here is
basically: I'm a bank and people are coming to me asking for a line of credit. I'm trying to decide whether to extend
them a line of credit or not. One way you could theoretically do this is to just look at your past history and say,
"Okay when people have come to us in the past and asked for credit, how old were they? What was their net worth? And did
we extend credit?" That's what this graph is displaying: the age and then the net worth. The plus or minus you can think
of as a one or zero of whether or not we extended them a loan. And then the machine learning part of this is basically
draw a line through this data. It doesn't have to be a line but a line works for this so we're just drawing a line. And
you'll see here that because in the past we had someone who is ninety and did not have very much money who came to us
and asked for a loan and we rejected them, if someone who is similarly aged and similarly wealthy came to us, they will
be below this line, so we would not extend them a line of credit. That is an example of a classification problem,
because there are classes involved. There is a "positive" (we did extend them a loan) or a "negative" (we didn't extend
them a loan) kind of question. Classification is great for when you're trying to find out what kind of thing something
is.&lt;/p&gt;
&lt;p&gt;Now, if I were a bank trying to decide how much credit I should extend to a person, I would have to use a regression
algorithm. Regression is very similar to classification in that you still have a number of input features and an output
of some sort. Here it's a little bit confusing because here only the X axis is an input. On this last slide both X and Y
were inputs and then the plus and minus was the output, but here we're just saying that the X axis is our input of net
worth. Someone comes up to us and they say, "I have five hundred thousand dollars, how much of a loan can I get?" The
"X" symbols aren't significant (they're just to mark position), but you can see (for instance) someone who had very high
of a net worth down near that bottom left hand corner only got a loan of a thousand dollars because that was a more
risky person (for instance). As a bank, I have all this historical data, and I can train some sort of algorithm that
would again draw a line, and we could then say if someone comes up to us and has a seven hundred fifty million dollars
net worth we can look at where they would land on the X position of the line.  Then the Y value then would be the size
of loan we extend them. Here it looks like seventy five thousand dollars or something. So that is another kind of
supervised machine learning algorithm.&lt;/p&gt;
&lt;p&gt;There are also unsupervised algorithms. One such algorithm is called clustering and in clustering you have a bunch of
data points, and I apologize that this is not the same example, but you can imagine that each of these dots is a
customer. The X axis could again be their age and Y axis could again be the net worth. Maybe it's computationally
prohibitive to do the calculation on the full data set, but you could theoretically cluster people and say there are
(for instance) eleven kinds of people. Then, depending on the kind of person you are, we could make a decision based off
of that. In clustering, you're not trying to get a specific output, you're just trying understand the data better. It's
often useful as a preprocessing step. You might again have someone come in and they have a certain net worth and certain
age. You could say this person is really similar to this other kind of person that we usually extend a credit to, so
let's extend credit this person.&lt;/p&gt;
&lt;p&gt;There's a lot of other stuff in the field of machine learning that is time prohibitive to talk about today. So this
third of this slide is here to prevent angry tweets because there's a lot of stuff that is really interesting that just
doesn't quite fit in today.&lt;/p&gt;
&lt;p&gt;So once we know the kinds of questions we can ask a computer, we can figure out a way to phrase our question. In my
example, I'm thinking, "OK, job alerts are too noisy for me. What do I want? I want to know what are the coolest jobs.
OK, well maybe I can ask a computer. Given my input of a job title, give me the output: does it sound cool or not (just
as a one or a zero)." And that's a way we can phrase our question in a way a computer can actually help us with. So now
that we have this formulation of our problem, we can jump into gathering our data.&lt;/p&gt;
&lt;p&gt;There's a lot of data out there, and the best thing to do is just search for it. Go out and Google it. For instance, one
time I was trying to determine my energy usage, and I thought it was probably going to be correlated with weather. I was
looking for weather data, and there is this government agency called the NOAA that has a big weather data set that you
can just download and use. And so it's very likely that you'll get out there and search for something and there's
already a government agency whose job it is to collect this data, which is really exciting because it means you then
have to do less work. In the case where you don't find something that already exists through searching, you can also try
various websites. data.world is one, Kaggle datasets also has a similar feel where they have a bunch of existing
datasets about usually more broad things. They don't tend to be a specifically relevant if that makes sense, although
they'll have things like crime data on their website. That may or may not be useful to you, but if if you're trying to
figure where to buy an apartment and you want to look at crime statistics, that dataset might already exist.&lt;/p&gt;
&lt;p&gt;So you may or you may not find the data you need, and if you don't you're going to have to create it at some point. I
like using spreadsheets for this because I can have them on my computer and I can have them on my phone, and anywhere I
am I can collect more data. Other than that there's a tool called If This Then That that can be useful, especially when
you're collecting data on your own personal habits. For instance, when I was trying to find out when the best time to
leave my office was to minimize my commute time, you can get a little button that IFTTT will make for you where when you
click it, it'll log your location and the current time to a Google Sheet. So what I would do was when I left the office,
I would press the button then when I got home I would press the button again. In that way I could calculate how long it
took me to get from my office to my home and at what time I left. And then I could have all this data about, okay you
can leave at this time (that's your input value) and it took you this long (that's the output value). Now I know that
Google Maps can also do this for me, but I'm a nerd and we are at a developer conference so I think it's fair to
over-engineer something.&lt;/p&gt;
&lt;p&gt;Beyond that, web scraping is another great tool. What this basically is downloading a website and picking out the
important bits. There are some legal things here, and I am not a lawyer, so do your own lawyer stuff but an important
tip is that when you're trying to scrape a website, look at their robots.txt. Whatever a website you're on, take the
domain name and put &lt;code&gt;/robots.txt&lt;/code&gt; and it'll have a listing of thing of basically the places you're not supposed to go if
you are a computer. Please obey that and you're probably fine, but again I'm not lawyer and this is not legal advice.&lt;/p&gt;
&lt;p&gt;And maybe the case is that you combined these two methods. That's exactly what I am doing in this project. I web scraped
a bunch of job titles, and then when I had spare time on the bus or something I could click through the links on my
phone, read the description and come back to say whether or not the jobs sounded cool. Columns A through D here are
existing data and then column E is the augmented data that I'm creating myself.&lt;/p&gt;
&lt;p&gt;You're going to need to clean this data.  I heard someone speaking at a conference and they said, "Fifty percent of data
science is cleaning data." And when he got done he had all these people coming up to him that said, "That's ridiculous!
At my job it's eighty percent!" There's two tools that I highly recommend if you're in the Python ecosystem: Pandas
(which does a great job of loading data into a tabular format in memory). I've heard it described as "in memory SQL".
And then scikit-learn has some stuff built in to massage data into a format that computers can more easily understand
that we'll get to in a moment.&lt;/p&gt;
&lt;p&gt;Now you may remember this graph from before that had numeric data. Computers are good at numbers; computers aren't as
good at words.  You may think, "Well, if I had someone's age and net worth, I easily see how those are just numbers. But
for something like a job title, that is different. That doesn't feel like I can just type that into a computer and have
it fit that into a graph because I I don't even know how that mapping would work. And so we want to introduce something
where we take as input the job title and as output whether or not it sounds cool, then turn it into some set of numbers.
The great thing is that when you run into a problem like this there are a wealth of giants whose shoulders you can stand
upon. You can just Google "text representation for machine learning" and out will pop this probably. This is the idea of
word count vectors or "bag of words." Essentially what's happening here is you'd take all of your job titles and you
keep track of either all of the words that were used in every single one or maybe the three hundred that are used most
frequently. Then you stack them all up, go through each job title, and count that how many times each word occurred in
the given job title. So we can see for this first job title "Senior Web Applications Developer" that the word "Engineer"
occurs zero times in this job title and the word "Web" occurs one time, etc. I'm not going to bore you by enumerating
this matrix but you see how this process works. "Word count vectors" is a fancy way of saying strings of numbers that
count up how many times a given word is in a given job title, and that gives us exactly what we're looking for. We can
now go from the job title and the output variable (of whether or not it sound cool) to this set of numbers where these
first ten numbers are the number of times a given word occurs (so maybe that first number is "senior") and that last
number there is a one because that job title sounded cool to me.&lt;/p&gt;
&lt;p&gt;And so now can start actually analyzing this data which is great. There's a few tools that I recommend for doing this
kind of work. Jupyter is really great; it's an interactive programming thing. Essentially you run it on your computer,
and you can load a browser up and do stuff, and it'll show you the output of it immediately (which is super helpful).
It's nice being able to see what the data looks like and it's nice to be able to understand what your next step should
be. You can also do neat things like drawing graphs, such as the one shown in this screenshot. The maintainers of this
project have put a ton of work to make it basically the de facto, interactive, iterative programming tool for data
science and data analysis people who are using Python. I spend a lot of my day in Jupyter Notebooks at work.&lt;/p&gt;
&lt;p&gt;I also definitely recommend Pandas and scikit-learn (which we talked about earlier). It's nice to not have to
re-implement all these algorithms from scratch because other people have already done it for you.&lt;/p&gt;
&lt;p&gt;So this is just a little code example to show you how easy this kind of stuff can be. Often times we talk about machine
learning and it sounds really scary and foreign. But when you actually look at the code you'll realize this is something
anyone can do. This is not complicated, it's just a little bit of understanding how these algorithms work and then
reading some documentation. I often call things X and Y because I'm just used to that nomenclature so I take our job
titles out of the dataset that I have and I put them into this X matrix. I take whether or not this sound cool and put
that in this Y vector. The next line is a CountVectorizer (it just does that word counting thing that we were talking
about earlier) and then you can just say, "OK, take this matrix and turn it into the word counts." Then you create a
model, you fit the data to it and then you can just call &lt;code&gt;.predict&lt;/code&gt; on it and it will give you this beautiful array
(that I have here bolded) that says, "Job zero did not sound interesting, and then job four does sound interesting." You
can get this all out very easily; it's not a lot of code to get a lot of value.&lt;/p&gt;
&lt;p&gt;The thing you want to do after you've been able to gather your data is just do the simplest thing that could possibly
work.  There's good reasons for doing this. Earlier, no one knew what the Approximation-Generalization Tradeoff was. My
hope is that you are about to learn. The idea here is that the better your algorithm approximates the input dataset, the
worse it is going to do at generalizing to data that is outside of your input data.  That's a little hard to just say
and have it be understood, so I made a little graph that I think will help. In the process of making this graph, I first
generated a true dataset. You can see here the blue line and this basically says that when we enter zero the value that
comes out is zero, and then at the far right end of the scale when we entered ten we expected negative twenty two out.
It is a very simple function that we're trying to estimate with our machine learning stuff (it's purely for
illustration). And you can see then I generated ten data points from this blue line by just adding a little bit of noise
to each point because the real world is fairly noisy for a number of reasons.  And then I fit two different models. The
red line is a nearest neighbor model (which is a more complicated model than the linear regression model,) and you can
see that it does an excellent job of representing the dataset that we gave to it.  It is matching perfectly at every
single data point there which is great, but you can also see that it is not very close to the truth. If you were to draw
more data points from the same truth we would basically find that the red line isn't doing a good job of generalizing to
data that it has not seen yet. This would be like if you were taking a practice test for a math class and all you did
was memorize the right answers to each question. You would do great at the practice test, but once you got to the real
test you would do horribly (because you don't know actually know how to do any of it, you just know the right answers).&lt;/p&gt;
&lt;p&gt;By contrast linear regression is doing a much worse job of approximating the input data set.  If you see for X value
equal zero it's roughly five units underneath that data point and similarly from the range like two, three, four it's
not doing a great job either of approximating the input data.  However one thing that you'll notice is that line on
average looks a lot closer to the truth than the red line does and the reason is that it is better able to generalize to
out of sample data. What we can see here is that the red line is doing what is called overfitting, which is when you
learn too much of the noise in your algorithm.  And that's a real problem that is easy to run into, especially when
you're using complicated methods. There's a lot of really interesting stuff on Hacker News that will have you believe
that you should use TensorFlow and PyTorch and all these really interesting and exciting deep learning frameworks. And
they are all very interesting and very exciting and have great applications, however they are often more prone to
overfitting than some simpler models. So it's great to just start with something simple and you can move on from that if
you need to.&lt;/p&gt;
&lt;p&gt;Another benefit is it often just easier. It's a lot easier to get scikit-learn running on a Mac or a Windows computer
than it is to get PyTorch or TensorFlow running. Aside from that, when you're iterating through this development process
it's a lot easier to have something that trains really fast and you can try out a bunch of different ways of
representing your data or different ways of sampling it (and have that be fast) rather than something that takes a long
time to train. In practice with this means is just start with something simple, linear regression and logistic
regression are both great models that are good places to start and you can use them both for regression or
classification.&lt;/p&gt;
&lt;p&gt;So we've gotten to the point here with this that we're in the deployment process. By this I mean that we have these
numbers right? We got those numbers out of our model (the zero and the one), but if I were to just look at zero, one,
zero, one, zero, zero, one, that doesn't do anything for me. And I also don't want to have to run that code myself every
time so one thing that I was thinking (because I am my own user, and I can kinda read my own mind) was I wanted to build
this email which you already saw earlier (spoiler alert, sorry) but basically the thought is I don't want to run this
thing on my own and I want to get just the relevant and interesting jobs. So what I did was just put it onto a server
that I rent and have it run every week, and then send me just the jobs that are interesting. It formats them a little
nicely and puts them into an email and sends them off to me. And this is a good thing to do is just build something
simple and ship it. Get it working, because the next step here (and if you've gone to any of talks that people have been
doing about agile practices this should sound familiar. Because just because we're using data in the software does not
make it any less software) is to test out our product. We still need to test out our product, we still need to try it on
actual users, we still need to figure out what doesn't work and what does work, we still need to iterate and we still
need to iterate again and try something new. And we need to iterate even another time, and we just keep moving and keep
trying new things until we get to something that works really well.&lt;/p&gt;
&lt;p&gt;So to summarize because that was a good amount of things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Step one is to have a problem. Find something that you want to solve that sucks in your life&lt;/li&gt;
&lt;li&gt;Once you have that thing, phrase it in a way that a computers can help you answer it. There's a lot of things
   computers are good at. If you have a problem that you can rephrase as a "how much?" or "what kind of?" problem, those
   are really great candidates for a machine learning application.&lt;/li&gt;
&lt;li&gt;Once you do that, you're ready to get some data.&lt;/li&gt;
&lt;li&gt;Try the simplest thing you possibly can and see how well that works. You can test out and iterate from there. But
   it's really important to get this in front of users and to just try new stuff.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;So that is the gist of what I have for you today.  If you want to learn more, there's a text book called Learning from
Data that is really excellent. It does a good job teaching machine learning in a good way. A lot of times when people
talk about machine learning, it comes off as a bag of tricks, but this book does a good job of helping you understand
some of the theoretical underpinnings that help make these algorithms work. And then from a less academic but more
practical side there's a blog called Practical Business Python where the author talks a lot about data visualization and
how to do useful stuff with Pandas and it's extremely useful when you're trying to learn this stuff.&lt;/p&gt;
&lt;p&gt;Also sponsors are good. We like them. If you are looking for a job I'm sure some of these people are hiring, and we're
very grateful to have them here. I would be remiss not to thank my employer (Indeed) because they paid for me to be here
so thanks for that. Other than that I hope that you've gotten something out of today and would love to meet you all
after. If you have any feedback: if it's negative please email me; I would love to hear what I can do to make this
better. If you have any positive feedback please tweet it. &lt;/p&gt;
&lt;p&gt;I hope you've gotten something out of today and are better able to go solve your problems.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/use-machine-learning-to-find-your-next-job.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 12 Nov 2017 01:42:33 GMT</pubDate></item><item><title>To become a great Python developer, quit reading Python books</title><link>https://www.samueltaylor.org/articles/quit-reading-python-books.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Learning is a crucial part of being a software developer. Increases in knowledge and skill can make a significant impact
in our work. But in an age where everyone seems to be in a time crunch, we want our learning to be as effective as
possible. We want to get as much learning out of our books, courses, and videos as we can. Counterintuitively, I believe
one of the best ways to maximize knowledge gain is by reading books that do not apply easily to technologies or concepts
we already understand.&lt;/p&gt;
&lt;p&gt;Over the last year, I've read two books that aren't written in Python (the primary language I use at work): &lt;em&gt;Working
Effectively with Legacy Code&lt;/em&gt;, by Michael Feathers, and &lt;em&gt;Practical Object-Oriented Design in Ruby&lt;/em&gt; by Sandi Metz.
Initially, I was skeptical as to whether I would gain much from them, as specific tactics seemed like they wouldn't be
applicable outside the language the author chose to use for each book. To my surprise, I learned &lt;strong&gt;more&lt;/strong&gt; from these
books precisely because they were not written in the primary language I use.&lt;/p&gt;
&lt;p&gt;I find three reasons for this. The first is that reading unfamiliar languages causes my brain to comprehend the material
better. Because I don't know Ruby at all, I have to put in more work to understand the text. When I read a snippet of
Python code in a book, I'm more likely to skim. But when I come across a bit of C++, my brain has to focus to really
understand how the code works.&lt;/p&gt;
&lt;p&gt;The second reason I think these books had an outsized contribution to my development skills is that they caused me to
think through the way to apply the ideas in a new context. While reading about how to make a well-designed Ruby class
that was amenable to testing, I was thinking about how to apply those lessons in Python. To be clear, I had to choose to
do this. And if you want to maximize your learning, you'll have to do the same. By forcing my brain not merely to
understand the information, but to also apply it in a new context, I learned the material more thoroughly.&lt;/p&gt;
&lt;p&gt;The final reason I think these books were so helpful is that they are great books. Kind of obvious, right? A
well-written book is more helpful than a poorly-written book. When we limit ourselves to just the materials relevant to
our day-to-day work, we miss out on gems written in other languages. Plenty of excellent books that can help us become
better software engineers exist; we shouldn't exclude some simply because of the language their author chose to use for
them.&lt;/p&gt;
&lt;p&gt;Putting ourselves in unfamiliar situations is hard. Because of that difficulty, it also has a huge potential to bring
about learning and growth. Pick up a book in a language you don't know and thoroughly study it–you may be surprised by
how much you learn.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/quit-reading-python-books.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Tue, 31 Oct 2017 11:47:23 GMT</pubDate></item><item><title>Build a "function with a memory" in Python</title><link>https://www.samueltaylor.org/articles/function-with-memory.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Are you familiar with the &lt;code&gt;__call__&lt;/code&gt; method in Python? By defining this method, an instance of your class can be called
as though it were a function. Here's a contrived example solely to demonstrate how it works:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Foo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_ct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;initialized&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_ct&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_ct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;bar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Foo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;bar&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;baz&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Foo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;baz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;baz&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;bar initialized
bar 1
baz initialized
baz 1
bar 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A more interesting use case is &lt;a href="https://twitter.com/brandon_rhodes/status/923393090920026114"&gt;given by Brandon Rhodes&lt;/a&gt;,
that of swapping out an &lt;code&gt;http_get(url)&lt;/code&gt; method for an object that caches pages. Say for instance that we are maintaining
a project that includes the following web crawling code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;urllib.request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;urllib.error&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HTTPError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;URLError&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getcode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;loc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;href&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;href=&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;quote_char&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quote_char&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;loc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;href&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_page&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;stack&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_page&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http_get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ne"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;URLError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="n"&gt;on_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;get_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;https://www.python.org&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now say we become unsatisfied with the performance of this code and want to stop getting the same page multiple times.
The standard library provides a &lt;a href="https://docs.python.org/3/library/functools.html?highlight=lru_cache#functools.lru_cache"&gt;caching
mechanism&lt;/a&gt; that we could
decorate our &lt;code&gt;http_get&lt;/code&gt; function with.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lru_cache&lt;/span&gt;

&lt;span class="c1"&gt;# ...&lt;/span&gt;

&lt;span class="nd"&gt;@lru_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getcode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But another option is an object that implements &lt;code&gt;__call__(self)&lt;/code&gt;. What might that look like?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CachedHttpGet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getcode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;http_get&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CachedHttpGet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;While &lt;code&gt;lru_cache&lt;/code&gt; is probably better in this contrived example, I hope this article gives you another tool for your
toolbox. The &lt;a href="https://docs.python.org/3/reference/datamodel.html#emulating-callable-objects"&gt;official docs are here&lt;/a&gt;.
Keep this in mind the next time you're refactoring something; it may be the right choice.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/function-with-memory.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 28 Oct 2017 16:32:48 GMT</pubDate></item><item><title>How to Attend a Tech Conference</title><link>https://www.samueltaylor.org/articles/how-to-attend-tech-conference.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Conferences can be both fun and valuable, but they can also be a huge waste of time. Here's some things I've learned
that have helped me get the most out of conferences as an attendee.&lt;/p&gt;
&lt;h2&gt;Consider skipping talks&lt;/h2&gt;
&lt;p&gt;This advice sounds crazy, I know; the whole point of conferences is to hear the speakers, right? I'm not convinced.  The
most valuable experiences I've had at conferences are centered around the people, not the talks.&lt;/p&gt;
&lt;p&gt;If sessions are recorded, you should go to very few sessions at all. Instead, ask people about what sessions they've
found particularly enlightening, then organize "lunch and learns" with your coworkers to watch those presentations.
While watching them, feel free to pause them and discuss. This discussion helps everyone involved more thoroughly
understand the material and how it may be applicable. Ironically, by skipping the talks, you often get more out of them.&lt;/p&gt;
&lt;h2&gt;Write in a pocket notebook&lt;/h2&gt;
&lt;p&gt;While I use my phone to keep track of lots of information, I find that at conferences nothing beats the simplicity of a
physical, paper notebook. The benefits are numerous. A notebook isn't going to flash a distracting notification on your
screen while you try to write down someone's email address or take notes on a talk. You'll look more put together and
less rude when writing something down on paper vs. on your phone. And I find that writing by hand forces my brain to be
more concise, a huge plus when you review these notes later.&lt;/p&gt;
&lt;p&gt;Aside from notes on any sessions you do end up attending, a notebook is also useful for keeping track of memorable
quotes, people's contact information, restaurant/activity recommendations, or (really) anything else.&lt;/p&gt;
&lt;p&gt;If you want to be trendy, lots of people like &lt;a href="https://fieldnotesbrand.com/products/original-kraft"&gt;Field Notes&lt;/a&gt;, but a
trip to your dollar store may yield something similarly valuable at a lower cost.&lt;/p&gt;
&lt;h2&gt;Create a followup plan&lt;/h2&gt;
&lt;p&gt;While pen and paper are excellent tools for capturing information, I have a different system for keeping track of my
todo list. Mine happens to be electronic and centered around Trello, but the ideas here can be applied to your personal
system.&lt;/p&gt;
&lt;p&gt;During the day, I gather a stack of business cards and contact information from interesting people. Then, when I have
spare time during/after the event, I take pictures of this information and add them all to a single "followup" card in
Trello. I make note of a relevant detail or two from each person as well. A few weeks later, I like to email these
people to follow up.&lt;/p&gt;
&lt;p&gt;If you don't have a plan, you're planning to fail.&lt;/p&gt;
&lt;h2&gt;Bring business cards&lt;/h2&gt;
&lt;p&gt;Handing someone a business card is often faster than writing down your email or having them type in your Twitter handle.
I prefer to use personal business cards (rather than those from my employer) because I want people to connect with &lt;em&gt;me&lt;/em&gt;,
not my company.&lt;/p&gt;
&lt;p&gt;I used &lt;a href="https://www.canva.com/"&gt;Canva&lt;/a&gt; to design and print my cards. Whoever you use, buy the smallest amount you can.
The per-unit price difference can make it tempting to buy 1000 cards, but you realistically aren't going to give out
1000 business cards before you become unsatisfied with some aspect of them. When you buy a smaller batch, you have the
freedom to change them more regularly. The relatively small per-unit cost increase is worth the added flexibility.&lt;/p&gt;
&lt;h2&gt;Track expenses on your phone, as they're happening&lt;/h2&gt;
&lt;p&gt;Keeping track of paper receipts is tedious compared to snapping a quick photo of your receipt. If your company
reimburses for travel/conference expenses, see if there's a mobile app for expense reports. By using it, you'll be less
likely to forget to expense something and you'll have less work to do when you get back from the conference.&lt;/p&gt;
&lt;h2&gt;Consider using Twitter&lt;/h2&gt;
&lt;p&gt;I usually avoid social media, but I created a Twitter account solely to connect with the people I was meeting at
conferences. It seems to be the preferred platform for many tech people, and it feels less stuffy than LinkedIn.&lt;/p&gt;
&lt;h2&gt;Use your calendar&lt;/h2&gt;
&lt;p&gt;Before the conference, look through the schedule and figure out which ones sound the most interesting. Put the name,
speaker's name, and location into your calendar. Even though you shouldn't be going to all of them, you can still enter
the most interesting talk in each time slot. By doing this, you should be able to avoid wasting time with schedules when
you could be talking to awesome people.&lt;/p&gt;
&lt;p&gt;As an aside, this has become one of my favorite use cases for my smart watch (a &lt;a href="https://www.amazon.com/Garmin-Forerunner-230-Black-White/dp/B016PAPI3W"&gt;Garmin
FR230&lt;/a&gt;). Knowing where you're headed without
having to pull out your phone is super convenient.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/how-to-attend-tech-conference.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Mon, 23 Oct 2017 23:33:01 GMT</pubDate></item><item><title>How I Hacked My University's Registration System with Python and Twilio</title><link>https://www.twilio.com/blog/2017/06/hacked-my-universitys-registration-system-python-twilio.html</link><description></description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.twilio.com/blog/2017/06/hacked-my-universitys-registration-system-python-twilio.html</guid><pubDate>Wed, 21 Jun 2017 05:00:00 GMT</pubDate></item><item><title>Speed up your Python-based web scraping</title><link>https://www.samueltaylor.org/articles/speed-up-web-scraping.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Sometimes when I'm working on a project that involves web scraping, the actual scraping starts to slow me down. If
you've ever re-run a script and then sat for a few minutes while your computer re-scraped the data, you know what I'm
talking about. I've found two simple and practical ways to make this process significantly faster.&lt;/p&gt;
&lt;p&gt;For the sake of example, say we're crawling two links deep on the front page of the New York Times. A straightforward
way of doing this is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import requests
from bs4 import BeautifulSoup

def get_links(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    return {e.get('href') for e in soup.find_all('a')
            if e.get('href') and e.get('href').startswith('https')}

links = get_links('https://www.nytimes.com')

all_links = set()
for link in links:
    all_links |= get_links(link)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On my machine/internet, this took about 103 seconds. We can do better than that!&lt;/p&gt;
&lt;h2&gt;Use &lt;code&gt;multiprocessing&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Python's &lt;code&gt;multiprocessing&lt;/code&gt; module can help speed up I/O-bound tasks like web scraping. Our case here is a good example
because we don't need to scrape each link separately; we can run them in parallel. The first step here is to convert our
code to use the built in &lt;code&gt;map&lt;/code&gt; function:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import itertools as it
# import requests
# ...
# links = get_links('https://www.nytimes.com')

links_on_pages = map(get_links, links)
all_links = set(it.chain.from_iterable(links_on_pages))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On my machine, this ran in a similar amount of time to the original example. From there, using &lt;code&gt;multiprocessing&lt;/code&gt; is a
quick change:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import multiprocessing
# import itertols as it
# ...
# links = get_links('https://www.nytimes.com')

with multiprocessing.Pool() as p:
    links_on_pages = p.map(get_links, links)
# all_links = ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example ran in about 25 seconds (~24% of the original time). The speed-up happens because Python spins up four
worker processes[0] that go through &lt;code&gt;links&lt;/code&gt; and run &lt;code&gt;get_links&lt;/code&gt; on each element. You can tweak the number of processes
that are spawned to get even faster wall-clock times. For example, by using 8 worker processes, the script took 16
seconds instead of 25.  This won't scale infinitely, but it can be a simple and effective way to speed things up in
cases where your code doesn't have to be entirely serial.&lt;/p&gt;
&lt;h2&gt;Cache to disk&lt;/h2&gt;
&lt;p&gt;One common use case I have for scraped data is to analyze it in a Jupyter notebook. I have a habit of using the "Restart
kernel and run all" option to re-run my whole notebook, but that means the scraping has to run again. I often don't want
to wait a few minutes for my computer to do something it already did 10 minutes ago. In cases like this, I've found
caching the results of my scraping to disk to be a useful way to avoid re-doing work.&lt;/p&gt;
&lt;p&gt;As a first step, let's move our existing code into a function:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def get_links_2_deep(url):
    links = get_links(url)
    with multiprocessing.Pool(8) as p:
        links_on_pages = p.map(get_links, links)
    return set(it.chain.from_iterable(links_on_pages))

print(len(get_links_2_deep('https://www.nytimes.com')))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can extend our code to cache the result of this function to disk by writing a decorator.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def cache_to_disk(func):
    def wrapper(*args):
        cache = '.{}{}.pkl'.format(func.__name__, args).replace('/', '_')
        try:
            with open(cache, 'rb') as f:
                return pickle.load(f)
        except IOError:
            result = func(*args)
            with open(cache, 'wb') as f:
                pickle.dump(result, f)
            return result

    return wrapper
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let's use the decorator:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;@cache_to_disk
def get_links_2_deep(url):
#    links = ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After the first time we run this script, it's able to load the cached result, which takes around a quarter of a second.
I find this useful while I'm writing and developing some analysis code, but I have to be mindful that to get the most
up-to-date results, I need to delete the &lt;code&gt;.pkl&lt;/code&gt; file that this is using as its cache. I happily take this tradeoff, and
if this technique fits your use case, you should too!&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;0: I say four here because my computer has four cores. When no arguments are passed to the &lt;code&gt;Pool()&lt;/code&gt; constructor, Python
chooses the amount of processes in the pool to be the result of &lt;code&gt;os.cpu_count()&lt;/code&gt;
(&lt;a href="https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool"&gt;docs&lt;/a&gt;).&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/speed-up-web-scraping.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 07 May 2017 22:32:17 GMT</pubDate></item><item><title>Income inequality in professional sports</title><link>https://www.samueltaylor.org/articles/sport-salary-inequality.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;On his podcast &lt;a href="http://revisionisthistory.com/"&gt;Revisionist History&lt;/a&gt;, Malcolm Gladwell talks about the difference
between "weak link" and "strong link" sports. A "weak link" sport is one in which the worst player's skill level has a
large impact on the team's success. The example Gladwell gives is soccer, in which a long chain of events must go
perfectly to score a point. To get the ball from one side of the field to a position in which a team can make a
successful shot on goal requires a lot of dribbling and passing. Every time the ball is passed is an opportunity for the
opposing team to break the chain, requiring the attacking team to start the chain from the beginning.&lt;/p&gt;
&lt;p&gt;By contrast, basketball is a "strong link" sport [0]. In such a sport, the best player's skill level (rather than the
worst) has a large impact on the team's success. A superstar in basketball can take a team to the playoffs almost
entirely on his own.&lt;/p&gt;
&lt;p&gt;If this "strong link"/"weak link" hypothesis is true and players are compensated proportionally to their contribution to
the team's overall success[1], I would expect income inequality to be greater in basketball than in soccer. At this
point, I went looking for data.&lt;/p&gt;
&lt;p&gt;After some searching, I found &lt;a href="http://www.spotrac.com/"&gt;Spotrac&lt;/a&gt;, which has salary data for the NFL, NBA, MLB, NHL, and
MLS. After scraping the site, I had a decent dataset of salaries. First, I looked at a histogram of the salaries:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(16, 10))
axes[1, 2].set_visible(False)

for ndx, league in enumerate(df['League'].unique()):
    league_df = df[df.League == league]
        league_df.plot(kind='hist', ax=axes[ndx % 2, ndx % 3], title='{} salary distribution ({} players)'.format(league, league_df.shape[0]))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="/static/img/salary_dist.png" alt="Professional sport league salary distribution"&gt;&lt;/img&gt;&lt;/p&gt;
&lt;p&gt;Standard deviation quantifies the variation in the distribution, but comparing the standard deviations across leagues
doesn't make sense because the mean salary in each league is so different. By dividing the standard deviation by the
mean, we get the &lt;a href="https://en.wikipedia.org/wiki/Coefficient_of_variation"&gt;coefficient of variation&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;aggregates = df.groupby('League').agg([len, np.mean, np.std, np.median])['Base Salary']
cv = (aggregates['std'] / aggregates['mean'])
cv.sort_values().plot(kind='bar', title='std as percent of mean')
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="/static/img/salary_std_over_mean.png" alt="Professional sport league salary standard deviation over mean"&gt;&lt;/img&gt;&lt;/p&gt;
&lt;p&gt;This tells us that MLS salaries vary most widely and NHL salaries vary the least. Digging deeper into the MLS, Bastian
Schweinsteiger is making $5,400,000 in base salary, with the next highest salary being Tim Howard's $2,000,000. Removing
just Schweinsteiger would leave the MLS with a CV of around 1.26, which is higher than the NBA's, but lower than the
MLB's.&lt;/p&gt;
&lt;p&gt;What have we learned? In terms of income inequality in the American professional sports leagues, soccer actually has the
most income inequality, and the NBA has the second-to-least. I think the reason that my initial hypothesis is incorrect
is twofold: (1) player contribution to team success is not the only factor in compensation and (2) teams don't
universally believe that basketball and soccer are strong and weak link sports (respectively).&lt;/p&gt;
&lt;p&gt;I would be curious to see how these results change if we're looking solely at starters (rather than entire rosters), but
that'll have to be another question for another day.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;0: Daniel Forsyth provides an interesting &lt;a href="http://www.danielforsyth.me/is-basketball-a-weakest-link-sport/"&gt;analysis of this
claim&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;1: They aren't, at least, they aren't solely compensated according to this factor. The team owner also gets value out of
selling jerseys and other merchandise, which is easier to do for more famous players.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/sport-salary-inequality.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 30 Apr 2017 22:03:47 GMT</pubDate></item><item><title>Similarities between cooking and coding</title><link>https://web.archive.org/web/20200128213245/http://square-root.com/2017/02/cooking-and-coding/</link><description></description><author>Samuel Taylor</author><guid isPermaLink="true">https://web.archive.org/web/20200128213245/http://square-root.com/2017/02/cooking-and-coding/</guid><pubDate>Fri, 24 Feb 2017 06:00:00 GMT</pubDate></item><item><title>Host all your projects on one machine with Docker</title><link>https://www.samueltaylor.org/articles/docker-for-projects.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Like many digital builders/hackers/makers, I have &lt;a href="../projects/index.html"&gt;several
projects&lt;/a&gt; that I want to put on the internet. I don't,
however, want to have loads of servers to manage. Given that the highest-traffic
project I maintain only around 100 unique visitors a day, it's feasible for me
to host them all on the same node.&lt;/p&gt;
&lt;p&gt;I already deploy these things in Docker containers, so I imagined I could use
nginx to solve this problem. It would work, I imagined, by having a single
Docker host running an nginx container that would reverse proxy requests to
other containers (which would be running the aforementioned projects). When I
started researching how to do this, I found a great project that made it super
easy.&lt;/p&gt;
&lt;p&gt;Jason Wilder has a &lt;a href="http://jasonwilder.com/blog/2014/03/25/automated-nginx-reverse-proxy-for-docker/"&gt;great
post&lt;/a&gt;
that you should read if you want more detail, but the gist is that
&lt;a href="https://hub.docker.com/r/jwilder/nginx-proxy/"&gt;jwilder/nginx-proxy&lt;/a&gt; is an nginx
container that proxies to other containers. It automatically configures nginx
based on the &lt;code&gt;EXPOSE&lt;/code&gt;d ports of the containers you're running, which is almost
magical.&lt;/p&gt;
&lt;p&gt;Start it up like this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;docker run -d -p 80:80 -v /var/run/docker.sock:/tmp/docker.sock -t jwilder/nginx-proxy&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;You can make this even cooler with a bit of fiddling with your DNS records. Add
an A record that points to the Docker host, then add a wildcard CNAME that
points to the URL you set up in the A record (see screenshot below for how I
have it set up).&lt;/p&gt;
&lt;p&gt;&lt;img src="/static/img/dns_config.png" alt="DNS configuration"&gt;&lt;/img&gt;&lt;/p&gt;
&lt;p&gt;(where "1.2.3.4" is the IP of your Docker host)&lt;/p&gt;
&lt;p&gt;Now, you can start up a container with a &lt;code&gt;VIRTUAL_HOST&lt;/code&gt; environment variable
that is in that subdomain:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;docker run -e 'VIRTUAL_HOST=rss.project.samueltaylor.org' -tid ssaamm/rss_filter&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;When you navigate to this URL, your browser will ask your nginx container for
the site at that URL, and that container will know to reverse proxy the request
to the container you just started. Nifty!&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Please reach out if you have any questions or want to get in touch! My email
address is sgt at this domain.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/docker-for-projects.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 18 Feb 2017 14:07:29 GMT</pubDate></item><item><title>The Last 5 Books I Read (as of March 2017)</title><link>https://www.samueltaylor.org/articles/books-mar17.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;The Braindead Megaphone, George Saunders&lt;/h2&gt;
&lt;p&gt;This collection of nonfiction essays was enjoyable and thought-provoking.
Saunders writes about the Iraq war, illegal immigration, and Dubai. He presents
his experiences in a very relatable way.&lt;/p&gt;
&lt;h2&gt;The Crying of Lot 49, Thomas Pynchon&lt;/h2&gt;
&lt;p&gt;I didn't like this one. I put it down about halfway through; I just couldn't get
into it. I didn't find anything about it intriguing in the slightest.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/books-mar17.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Tue, 10 Jan 2017 23:31:12 GMT</pubDate></item><item><title>Wherever you are, be all there</title><link>https://www.samueltaylor.org/articles/be-all-there.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Being present (paying attention to the here and now) and purposeful (taking
actions to achieve some end) helps us to live more fulfilling lives. Jim Elliot
once said, "wherever you are, be all there." While it is often painted on
flowery backdrops and put on the walls of suburban homes, this quote holds
useful advice for productivity and professional growth.&lt;/p&gt;
&lt;p&gt;If you're doing something, you should apply all your efforts to that thing, and
avoid distractions or other tasks. Turn off your phone, close your chat program,
and just work. When your brain encounters a difficult task like learning a new
skill or solving a challenging problem, it tries to avoid spending the energy on
that hard thing. Instead, you'll suddenly feel the urge to check your text
messages or open up Twitter.  Your brain might start reminding you about the
fact that you need to schedule a doctor's appointment or get your car washed.
You've got to fight these urges, or you'll get derailed from the task at hand
and become less productive. You will be more productive if you focus on and
complete your current task and then apply all your focus to (for instance)
scheduling a doctor's appointment than you would if you try to do both at once.
To help avoid your brain's weakling pleas for relief from the mental workout,
many people find it useful to set a timer for a set period of working on a
specific task (see for example &lt;a href="https://en.wikipedia.org/wiki/Pomodoro_Technique"&gt;the Pomodoro
technique&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This quote is also great career development advice. Wherever you find yourself
professionally, you should use the resources at your disposal to your maximum
advantage. Find the truly great people at your company and try to learn from
them. Take them to coffee and ask them how they are so effective. Ask them to
critique your work. Over the course of your daily interactions, observe how
they handle situations you would be uncomfortable in. Beyond the people at your
company, try to get yourself assigned to projects that stretch your abilities.&lt;/p&gt;
&lt;p&gt;Life is more rewarding when we live purposefully in the present. By focusing on
one task at a time and making the most of our resources, we can become better
people and find more fulfillment.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/be-all-there.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 13 Nov 2016 21:45:23 GMT</pubDate></item><item><title>Experiments with Self-Tracking/Quantified Self</title><link>https://www.samueltaylor.org/articles/quantified-self.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;How many times in a row do I typically sneeze?&lt;/h2&gt;
&lt;p&gt;Ever since I can remember, I've sneezed atypically. While most people sneeze
once and are done, I often sneeze 3, 4, or even 8 times in a row. I was curious
to see how often I sneeze a certain number of times in a row, so I kept track of
every time I sneezed over the course of a week. Here are the results:&lt;/p&gt;
&lt;p&gt;&lt;img src="/static/img/quantified_self_sneeze_histogram.png" alt="Sneeze histogram"&gt;&lt;/img&gt;&lt;/p&gt;
&lt;p&gt;All this time, I though it was fairly rare for me to sneeze once. In reality, I
sneeze once in a row with decent frequency, and I sneeze twice very
infrequently.&lt;/p&gt;
&lt;h2&gt;How long does it take to get to work?&lt;/h2&gt;
&lt;p&gt;For a little while, I tracked my commute time with a DO Button that logged the
time and my location to a spreadsheet. After &lt;a href="https://github.com/ssaamm/personal-etl/blob/d96b546f62cb36f865b938acef5de71e069499dc/commute_time.py"&gt;a little cleaning&lt;/a&gt;,
the data looks like this:&lt;/p&gt;
&lt;div id="commute-time"&gt;&lt;/div&gt;

&lt;script&gt;
  d3.csv('../static/data/commute_time.csv', function(data) {
    MG.data_graphic({
      data: data.map(function transform(point) {
        return {
          'start_time': new Date(point.start_time),
          'duration': point.duration / 60,
          'destination': point.destination == 'WORK' ? 'Work' :
            point.destination == 'HOME' ? 'Home' : 'Unknown',
        };
      }),
      //full_width: true,
      //height: 400,
      //right: 40,
      chart_type: 'point',
      target: '#commute-time',
      x_accessor: 'start_time',
      y_accessor: 'duration',
      y_label: 'Duration (min)',
      color_accessor: 'destination',
      color_type: 'category',
      area: false,
      interpolate: 'linear',
    });
  });
&lt;/script&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/quantified-self.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Tue, 08 Nov 2016 17:35:48 GMT</pubDate></item><item><title>Writing Better Code: Code as Communication</title><link>https://www.samueltaylor.org/articles/better-code-code-as-communication.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;After we compile our code, why do we keep it? The machine is able to interpret
the compiled code and perform the function we wanted it to, so why keep the
original code around? Often, the answer to this question is that we may need to
modify the program in the future. Modifying a bunch of 1's and 0's in a compiled
file would be very costly in terms of developer time. We keep the original code
because we can maintain it more easily than the compiled code.&lt;/p&gt;
&lt;p&gt;Code is read more often than it is written. As such, when we are writing code,
we should keep future developers in mind. Our code is a tool for communicating
with future developers who are attempting to maintain it. I find two worthwhile
ways to make code more communicative.&lt;/p&gt;
&lt;h2&gt;Avoid unnecessary comments&lt;/h2&gt;
&lt;p&gt;Only write comments that explain something which isn't apparent after reading
the code.&lt;/p&gt;
&lt;p&gt;As an example, suppose we're writing an application that interacts with a web
API to get the current weather. We might write something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;r = requests.get('http://weather.example.com/currentweather?zip=76706')
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A bad comment for this code might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# make a GET request to the weather API for a given ZIP code
r = requests.get('http://weather.example.com/currentweather?zip=76706')
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This comment is unnecessary because it's more or less restating what the code
says. Any developer reading this code sans the comment will be able to surmise
that it makes a GET request to some weather API for some ZIP code. Because
developers can figure this out without the comment, the comment is unnecessary.&lt;/p&gt;
&lt;p&gt;Unnecessary comments are unnecessary; who would've guessed? More interestingly,
I would argue that they can be harmful. In the future, the API and our use of it
may change in a number of ways. Suppose we want to lookup weather with a
latitude/longitude coordinate or that the developers of the API require us to
make a POST request. We open our code back up, find the line where we look up
the weather, and modify it&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# make a GET request to the weather API for a given ZIP code
r = requests.post('http://weather.example.com/currentweather?lat=31.5491667&amp;amp;lon=-97.1463889')
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Oops! In our haste to update our code, we forgot to update the comment above it,
which is now an incorrect description of the code below it. Our program
continues to work just fine (the compiler or interpreter doesn't care about the
comments), but we've left a confusing artifact for the next developer to find.
Unnecessary comments are harmful because they can become out of sync with the
code they are describing, creating confusion and slowing down developers.&lt;/p&gt;
&lt;p&gt;A better comment might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# get current weather in Waco
r = requests.get('http://weather.example.com/currentweather?zip=76706')
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This comment is better because it's speaking to the reasoning for the code below
it. As a rule of thumb, comments should describe the &lt;em&gt;why&lt;/em&gt; rather than the
&lt;em&gt;what&lt;/em&gt;. Because this comment describes the &lt;em&gt;why&lt;/em&gt; rather than the &lt;em&gt;what&lt;/em&gt;, it is
still true when we modify our code as we did abobe--we're still getting the
current weather in Waco.&lt;/p&gt;
&lt;p&gt;Still, I don't love this comment. It's probably unnecessary, as a developer can
probably figure out that we're getting the weather (though they may not know
Waco's ZIP code). At the same time, we may not want future developers to
have to take the time to read this code thoroughly; that's wy we wrote a comment
in the first place! If our intent is to create a program that is easily read and
understood, I think there are often better tools than comments.&lt;/p&gt;
&lt;h2&gt;Write self-documenting code&lt;/h2&gt;
&lt;p&gt;Name and create the constructs of your programs in such a way as to be easily
read and understood by future maintainers.&lt;/p&gt;
&lt;p&gt;In our previous example, we were getting the current weather in a given place.
Because we didn't want a future developer to have to read our code in order to
understand it, we wrote a comment explaining what it did. Alternatively, we
could have created a well-named function:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def get_current_weather(zip_code):
    return requests.get(f'http://weather.example.com/currentweather?zip={zip_code}')
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this function might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;get_current_weather(zip_code=76706)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because the function's behavior is easy to infer from its name, maintainers
don't have to read the definition of &lt;code&gt;get_current_weather&lt;/code&gt; to understand what it
does (though they can easily choose to). Further, changes to the function can be
enforced by the interpreter. Suppose we modify this function to take a
latitude/longitude coordinate:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def get_current_weather(lat, lon):
    return requests.get(f'http://weather.example.com/currentweather?lat={lat}&amp;amp;lon={lon}')
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, if we try to run our program without updating our calls to that function,
the interpreter will tell us:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;TypeError: get_current_weather() got an unexpected keyword argument 'zip_code'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By creating a well-named function, we not only improved our program's
readability, we also made it harder for maintainers to break our program
unintentionally.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;You should be thoughtful about the code you write because the marginal cost of
being a bit more thoughtful on writing the code is less than the cost of the
additional time future developers will have to spend in order to read your code.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/better-code-code-as-communication.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 29 Oct 2016 15:17:05 GMT</pubDate></item><item><title>k-Nearest Neighbors</title><link>https://www.samueltaylor.org/articles/2016-10-18_knn.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;On 18 Oct 2016, I gave a talk at Austin ACM SIGKDD on the &lt;em&gt;k&lt;/em&gt;-nearest neighbors
algorithm. Topics included some machine learning theory (approximation vs.
generalization, VC dimension), the algorithm itself, proving the algorithm's
performance, and some practical concerns around choosing &lt;em&gt;k&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Some other topics that I would probably include next time are similarity
functions, high-dimensional spaces, and categorical data.&lt;/p&gt;
&lt;p&gt;You can find the slides &lt;a href="/static/pdf/2016-10-18_knn.pdf"&gt;here&lt;/a&gt;. Note that my
presentation probably won't make a ton of sense from these slides, as they were
mostly aids to the words I was saying out loud. If you've got any questions,
feel free to email me; I'd love to chat!&lt;/p&gt;
&lt;p&gt;Thanks to everyone who came to watch; I appreciated hearing your feedback!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/2016-10-18_knn.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 27 Oct 2016 13:29:58 GMT</pubDate></item><item><title>Python Puzzlers</title><link>https://www.samueltaylor.org/articles/python-puzzlers.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;Default arguments&lt;/h2&gt;
&lt;p&gt;What is the output of this code?&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def foo(arg=[]):
    return arg

my_list = foo()
my_list.extend('abc')

print(foo())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My first intution was that this would output the empty list (&lt;code&gt;[]&lt;/code&gt;). However, the
output is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;['a', 'b', 'c']
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Why does this happen? Well, Python evaluates default arguments when the function
is defined, rather than when it's run. As a consequence, if a default argument
is mutable and is mutated in one function call, future function calls will be
working with the mutated argument.&lt;/p&gt;
&lt;h3&gt;Check your understanding&lt;/h3&gt;
&lt;p&gt;Now that you know a bit more about default arguments, what is the output of this
code?&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def bar(arg=[]):
    arg.append('a')
    return arg

bar()
print(bar())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you answered:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;['a', 'a']
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;then you're right!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/python-puzzlers.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 29 Sep 2016 00:32:46 GMT</pubDate></item><item><title>The Last 5 Books I Read (October 2016)</title><link>https://www.samueltaylor.org/articles/books-october16.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;Station Eleven, Emily St. John Mandel&lt;/h2&gt;
&lt;p&gt;I picked up this book purely for entertainment, and it served that purpose well.
The pacing is quick (at times excessively so), and the events are interesting. I
found nothing about it revolutionary, but it was well-written. In particular, I
liked the dialogue because it felt real (perhaps accentuated by the fact that I
read the book aloud).&lt;/p&gt;
&lt;p&gt;As events unfold in a few different timeframes, we see some characters and
events in a new light. While many of these unveilings worked well, they
sometimes felt like they were explaining too much, taking away some of the fun
of piecing together the story. Following these threads of story through various
timeframes is fun, but feels pointless sometimes. For instance, the author
traces a paperweight's journey which I didn't find engaging.&lt;/p&gt;
&lt;p&gt;The characterization left something to be desired. Only once did I feel like I
understood a character at a non-surface level (the description of Miranda's
thoughts on clothes being "armor" gives insight into her post-divorce
life/feelings). Still, while I didn't end up caring very much about any of the
characters in particular, I kept reading because the story was interesting.&lt;/p&gt;
&lt;p&gt;This book seems like one that many will like, but few will choose as their
favorite. That's completely fine; not every book can be a total masterpiece. Go
into it looking to be entertained, and you probably will be.&lt;/p&gt;
&lt;h2&gt;So Good They Can't Ignore You, Cal Newport&lt;/h2&gt;
&lt;p&gt;I loved &lt;em&gt;Deep Work&lt;/em&gt;, so I decided to read this one, too. The book is about
creating a fulfilling career, which seems appropriate for a new college grad.
His premise seems solid to me--"follow your passion" is terrible advice. I
appreciate the cynicism of lifestyle bloggers (it seems the only ones making a
living are the ones selling lifestyle blogging to people who hate their jobs).&lt;/p&gt;
&lt;p&gt;Newport tells the stories of several individuals well. Like all self-help books,
these stories keep the book moving while demonstrating a point relevant to the
larger topic.&lt;/p&gt;
&lt;p&gt;My least favorite part of the book was that the author sometimes comes off as
overly cocky, which is annoying at best.&lt;/p&gt;
&lt;p&gt;Like lots of books in this genre, it got a little repetitive; he summarizes
himself over and over again.&lt;/p&gt;
&lt;p&gt;Aside from these two criticisms, though, I really liked the book. I seem to have
lucked out in that I'm fascinated with computer science/software development,
and the market seems to also like it. Still, the career advice seems well
thought out and will be useful in the coming years.&lt;/p&gt;
&lt;h2&gt;You Can't Win, Jack Black&lt;/h2&gt;
&lt;p&gt;Originally published as a series of newspaper articles, this book is the
autobiography of a rail riding, jewel thieving hobo in the late 19th and early
20th century named Jack Black. He recounts many tales including prison
sentences, hobo rituals, and his most interesting crimes. The glimpse Black
offers into a very specific subculture is fascinating. If you're interested in
reading a collection of true, interesting tales about a life on the road,
consider picking this book up.&lt;/p&gt;
&lt;h2&gt;Pastoralia, George Saunders&lt;/h2&gt;
&lt;p&gt;This collection of short stories enjoys giving its readers a look into the
internal monologues of its characters. The titular story was gripping, forcing
its reader to piece together the world from hints in the text.&lt;/p&gt;
&lt;p&gt;While the book is a collection of short stories, they are all drawn from the
same universe--an exaggerated version of America. I recently finished
watching the series &lt;em&gt;Black Mirror&lt;/em&gt;, and reading this book reminded me a bit of
that series. Though there's less technology in this book, it still feels like
the author is using a somewhat imagined world to critique our real one.&lt;/p&gt;
&lt;h2&gt;Slaughterhouse-Five, George Saunders&lt;/h2&gt;
&lt;p&gt;This book is a unique blend of science fiction and World War 2 tale. Its central
character, Billy Pilgrim, is cast about in many ways by the war. Perhaps
uncoincidentally, he sort of trips through time, which makes for an interesting
literary device.&lt;/p&gt;
&lt;p&gt;This book leaves me feeling very bleak. Pilgrim adopts the belief that
everything that will happen will happen, is happening, and has always been
happening; we are like bugs trapped in the amber of this moment. This belief
takes away hope and meaning. Without these two things, I'm not sure life is
worth living.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/books-october16.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Tue, 06 Sep 2016 00:07:40 GMT</pubDate></item><item><title>Quotes/thoughts that I like</title><link>https://www.samueltaylor.org/articles/important-quotes.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;There are four hard things in life. If you can do these four things, the rest
  comes easy:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Working hard&lt;/li&gt;
&lt;li&gt;Doing your best&lt;/li&gt;
&lt;li&gt;Telling the truth&lt;/li&gt;
&lt;li&gt;Taking responsibility for your actions&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;My brothers and sisters, whenever you face trials of any kind, consider it
  nothing but joy, because you know that the testing of your faith produces
  endurance; and let endurance have its full effect, so that you may be mature
  and complete, lacking in nothing. (James 1:2-4, NRSV)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The poison from which the weaker nature perishes strengthens the strong
  man–and he does not call it poison (Friedrich Nietzsche, &lt;em&gt;The Gay Science&lt;/em&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Perfection is achieved not when there is nothing more to add, but when there
  is nothing left to take away. (Antoine de Saint-Exupery)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Note: the attributions on these are only a best guess&lt;/em&gt;&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/important-quotes.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Mon, 29 Aug 2016 23:52:25 GMT</pubDate></item><item><title>Planning a trip to Europe like an engineer</title><link>https://www.samueltaylor.org/articles/europe_trip.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description></description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/europe_trip.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 09 Jul 2016 05:00:00 GMT</pubDate></item><item><title>Books I read in June 2015</title><link>https://www.samueltaylor.org/articles/books-june15.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;This month, I read four books:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Till We Have Faces, C.S. Lewis&lt;/li&gt;
&lt;li&gt;A Walk in the Woods, Bill Bryson&lt;/li&gt;
&lt;li&gt;Of Mice and Men, John Steinbeck&lt;/li&gt;
&lt;li&gt;Into the Wild, Jon Krakauer&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Till We Have Faces, C.S. Lewis&lt;/h2&gt;
&lt;p&gt;I really enjoyed this one. It is a re-telling by Lewis of myth of Psyche and
Cupid. I was unfamiliar with the original myth, but appreciated this book very
much.&lt;/p&gt;
&lt;p&gt;I particularly enjoyed the clash in worldviews between a physically-oriented
worldview and one which stressed the importance of supernatural beings. The
differing advice offered to Orual by the Fox and the captain of the king's guard
is intriguing. The relationship between reason and faith is interesting, so I
enjoyed watching Orual process her thoughts on the world through the book.&lt;/p&gt;
&lt;p&gt;The book also stands out as a well-written story. I enjoyed the pace of it, and
I would highly recommend it to anyone who enjoys mythology.&lt;/p&gt;
&lt;h2&gt;A Walk in the Woods, Bill Bryson&lt;/h2&gt;
&lt;p&gt;This was a good read. Every once in a while, I would start to get a little bored
by the writing about the history or biology of the Appalachian Trail, but then
Bryson would say something to the effect of, "Enough science for now, back to
the interesting stuff."&lt;/p&gt;
&lt;p&gt;I enjoyed the interactions between Bryson, Katz, and others they ran into on the
trail.&lt;/p&gt;
&lt;p&gt;On an unrelated note, as of writing, Scott Jurek was about to break the FKT for
the AT. Jurek is an impressive athlete, and it's awesome to see such a huge
accomplishment.&lt;/p&gt;
&lt;h2&gt;Of Mice and Men, John Steinbeck&lt;/h2&gt;
&lt;p&gt;Quick read -- I think it took me an afternoon. I hadn't been spoiled for it, so
I didn't know what was coming. This book was very sad, but in an enjoyable way.
In a way, the ending sort of snuck up on me. I was observing the world Steinbeck
set before me, and then suddenly the ending came. It felt abrupt, and it felt
sad.&lt;/p&gt;
&lt;h2&gt;Into the Wild, Jon Krakauer&lt;/h2&gt;
&lt;p&gt;While I wasn't a fan of how much Krakauer seemed to revere McCandless, I enjoyed
hearing about McCandless's journey.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/books-june15.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 26 Jun 2016 16:04:55 GMT</pubDate></item><item><title>The Last 5 Books I Read (June 2016)</title><link>https://www.samueltaylor.org/articles/books-june16.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;A View from the Bridge, Arthur Miller&lt;/h2&gt;
&lt;p&gt;This is a script I read for my American Literature class. It was alright. Even
though it has a distant setting, the characters still feel real. They each have
their own needs/desires which conflict in certain ways, which causes the
drama/action. My favorite scene is one in which they're all in the apartment,
and Catherine and Rodolfo start dancing. The subtext in that scene is very
interesting.&lt;/p&gt;
&lt;p&gt;People talk about the ending being shocking, and I don't know that I agree.
There's enough foreshadowing and character development that it isn't a surprise.&lt;/p&gt;
&lt;p&gt;Anyway, it's a fine script. I imagine it's better as an actual staged
performance, and I would probably go see it if I knew it was being performed.&lt;/p&gt;
&lt;h2&gt;Questions for All your Answers, Roger E. Olson&lt;/h2&gt;
&lt;p&gt;This would have been a good book to give a 13 year old version of myself. At
that time, I was struggling with how anti-intellectual the church seemed.
Fortunately, for a few reasons I came to find and appreciate the intellectual
tradition of Christianity. As such, I don't think I'm necessarily the target
audience for the book. Reading it is still enjoyable, but it's not
earth-shattering for me.&lt;/p&gt;
&lt;p&gt;More generally, Olson seems to be caught in the awkward task of speaking
intellectually and theologically to an audience hostile (or apathetic, at best)
toward theological thinking. His writing feels at times pulled in different
direction. In one corner, he's trying to explain complicated theological lines
of thought. In another corner, he's trying to keep it simple enough that people
without much theological background can understand it.&lt;/p&gt;
&lt;p&gt;That said, there are still parts of the book that I liked. The third chapter was
particularly good; it talks about cultural sensitivity and the Trinity in a very
practical way that still respects the ideas.&lt;/p&gt;
&lt;h2&gt;Room, Emma Donoghue&lt;/h2&gt;
&lt;p&gt;I watched the movie adaptation on a plane and enjoyed it thoroughly. The book is
written from the perspective of a child, which seems like an annoying premise.
At times the perspective is a bit annoying, but overall I was surprised by how
well it worked.&lt;/p&gt;
&lt;p&gt;All in all, I enjoyed it a good deal. Because of the narrator's limited
knowledge, I was constantly intrigued and kept reading to find out more. While I
didn't, you could probably read it in one sitting as it's relatively short.&lt;/p&gt;
&lt;h2&gt;Ready Player One, Ernest Cline&lt;/h2&gt;
&lt;p&gt;I didn't like the main character at the beginning of the book, which was not
enjoyable. I suppose you could argue that this is Cline setting up for some
character development, but that isn't a satisfying answer. Ideally, there would
be a way to establish that the character is hauty without making readers hate
him. Luckily, Parzival/Wade became a more interesting, less prideful character
as I got further in the book.&lt;/p&gt;
&lt;p&gt;At times, something would happen in the book that felt irrelevant enough to the
plot that I would be drawn out of the action and start wondering how he was
going to use that later in the book (he always did end up using or referencing
it). I understand that foreshadowing is neat, but it felt a little obvious.&lt;/p&gt;
&lt;p&gt;Some of the hacking felt unrealistic. Even within the context of a SciFi
universe, it doesn't seem like these sensitive systems should be so easily
hackable.&lt;/p&gt;
&lt;p&gt;Despite these criticisms, this was a very enjoyable read. Admittedly, I'm a
nerd, so I don't know how the reference-laden prose would feel to someone who's
not a fan of nerdy books and games. The book never felt like it was dragging,
which made reading it a good time.&lt;/p&gt;
&lt;h2&gt;Oblivion: Stories, David Foster Wallace&lt;/h2&gt;
&lt;p&gt;This book is a collection of eight short stories. Of these, I found three to be
wonderful, four or five to be good, and the other one or two weren't my cup of
tea.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Incarnations of Burned Children&lt;/em&gt; is short, gripping, and emotional. Wallace
does an excellent job of making the story seem real and important.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Another Pioneer&lt;/em&gt; is a story about an ancient society told through two or three
retellings. The levels of redirection allow Wallace to explore some narrative
branching patterns I found fascinating.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Good Old Neon&lt;/em&gt; is a story about a man who looks introspectively and finds
within himself nothingness. His attempts at dealing with this discovery are
interesting.&lt;/p&gt;
&lt;p&gt;I noticed two common patterns among all the stories. First, I often started
reading, got a few pages in, started to think to myself "This one's kinda
boring...", and then suddenly Wallace would introduce something that hooked me.&lt;/p&gt;
&lt;p&gt;Second, most (maybe all) of the stories end with a level of ambiguity. Because
of this, the stories left me thinking about them on and off in the days after
reading them.&lt;/p&gt;
&lt;p&gt;Reading this book was well worth my time.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/books-june16.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 26 Jun 2016 16:04:55 GMT</pubDate></item><item><title>The Last 5 Books I Read (February 2016)</title><link>https://www.samueltaylor.org/articles/books-feb.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;Redirect, Timothy Wilson&lt;/h2&gt;
&lt;p&gt;A book from a college professor about his (and others') research in "story
editing" techniques that target the narratives we tell ourselves. I particularly
appreciated the thoughts on happiness, as those are the most widely applicable.
Because I'm not a parent, I couldn't quite get into the stuff about parenting as
much. The discussions of various societal issues and how they might be addressed
were also interesting, but not particularly applicable to my life.&lt;/p&gt;
&lt;h2&gt;The Martian, Andy Weir&lt;/h2&gt;
&lt;p&gt;I liked this book a lot. The attention to detail made it feel very real and
interesting. The point of view from which it is written adds a lot of tension
and excitement. As an aside, I enjoyed the book much more than the movie
because the book is more detailed and the writing style is enjoyable.&lt;/p&gt;
&lt;h2&gt;Deep Work, Cal Newport&lt;/h2&gt;
&lt;p&gt;The first 30% of the book is an argument for why deep work is necessary in
today's economy. That 30% is good, but I was already convinced of the value of
what he calls "deep work," so they weren't the most interesting thing I've read.&lt;/p&gt;
&lt;p&gt;He has many good suggestions. One I've liked so far is to keep a sort of
deep work scoreboard, where you log how many hours of deep work you get in each
day. Then, in your weekly review, you can use that data to keep yourself
accountable. This suggestion comes out of the idea of focusing on &lt;em&gt;lead
measurements&lt;/em&gt;, which is itself a good idea. If you have a goal (e.g. get more
personal projects done), the obvious way to measure your progress on that goal
(e.g. how many projects you've finished) is often a &lt;em&gt;lag measure&lt;/em&gt;. That is, by
the time you see the improvement on the lag measure, you've already made the
improvements in your personal processes which allowed for it. By contrast, lead
measurements help you quantify the change in your personal processes that will
ultimately enable you to achieve the goal you're working toward.&lt;/p&gt;
&lt;h2&gt;Coders at Work, Peter Seibel&lt;/h2&gt;
&lt;p&gt;This is a book of interviews with notable programmers. While some of the stories
seem dated (most of the people in the book got their start decades ago), it's
interesting to read others' reflections on programming.&lt;/p&gt;
&lt;p&gt;A detail I like is that several interviewees mention how important reading other
people's code is. I didn't really read much code until I did an internship at a
software company, and then I almost felt like I was drowning in it. One company
I worked for practiced "self-documenting code," so the way I learned what
certain pieces of code was simply to read it. Once I got used to it, I liked
this system. That internship was the first time I really understood the value of
readable code.&lt;/p&gt;
&lt;p&gt;Douglas Crockford talks about code quality in an interesting way. I like the
idea of taking every seventh sprint to focus on improving the codebase. One of
my employers did not focus on code quality, and their company suffered for it.
At a certain point it becomes difficult to retain developers when the codebase
makes it harder to develop and easier to introduce bugs. I'm sure they didn't
intend to get to that point, which demonstrates the importance of understanding
quality as an ongoing process.&lt;/p&gt;
&lt;p&gt;I enjoyed Joshua Bloch's thought that some coding is more similar to writing
prose than it is to mathematics. The way he talks about creating good API's and
readable code inspires me to be a better programmer and designer.&lt;/p&gt;
&lt;h2&gt;The Two Towers, J.R.R. Tolkien&lt;/h2&gt;
&lt;p&gt;Another interesting book in the series. The sense of adventure and the grandness
of the world make reading this book enjoyable.&lt;/p&gt;
&lt;p&gt;I want to be Treebeard when I grow up. He's such an interesting character, and I
think his orientation to time is very interesting. A detail I like in particular
is that his name in the ent language is very long, as ents believe that names
should tell something of the thing's story.&lt;/p&gt;
&lt;p&gt;At times, I get lost in all the detail and have difficulty keeping my mental
picture straight. Perhaps I would gain more out of the reading if I were to hold
on to more details, but that feels like more work than I want to put into
reading this book. I don't mean to be lazy; I would just rather spend my mental
energy elsewhere. Though I lack a lot of the detail (the geography of the area,
for instance, is completely lost on me), I feel like I understand the story well
still, and I still enjoy reading it.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/books-feb.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 20 Feb 2016 20:13:01 GMT</pubDate></item><item><title>Teaching Sign Language with Leap Motion + Machine Learning</title><link>http://blog.leapmotion.com/asl-tutor-teaching-sign-language-leap-motion-machine-learning/</link><description></description><author>Samuel Taylor</author><guid isPermaLink="true">http://blog.leapmotion.com/asl-tutor-teaching-sign-language-leap-motion-machine-learning/</guid><pubDate>Sun, 14 Feb 2016 06:00:00 GMT</pubDate></item><item><title>Useful Python language features for interviews</title><link>https://www.samueltaylor.org/articles/python-tools-for-interviews.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;&lt;code&gt;collections.namedtuple&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;namedtuple&lt;/code&gt; can make your code a lot more readable. In an interview, that's
helpful for a few reasons. First, it can help you demonstrate a good
understanding of some of Python's standard libraries. Second, it helps you show
off that you place importance on writing readable code. Third, it makes writing
your code easier. If you're passing around tuples, it can be easy to forget
what the object at each index into the tuple is. Using a &lt;code&gt;namedtuple&lt;/code&gt; can help
you avoid that.&lt;/p&gt;
&lt;p&gt;Consider the case where you need to represent colors. You could choose to do so
with a 3-tuple of the form &lt;code&gt;(i, j, k)&lt;/code&gt; (where &lt;code&gt;i&lt;/code&gt;, &lt;code&gt;j&lt;/code&gt;, and &lt;code&gt;k&lt;/code&gt; are integers on
the range 0-255). This representation seems intuitive and natural enough. &lt;code&gt;i&lt;/code&gt;
could be the value for red, &lt;code&gt;j&lt;/code&gt; for green, and &lt;code&gt;k&lt;/code&gt; for blue. A problem with this
approach is that you may forget which of the three numbers represents which
primary color of light. Using a &lt;code&gt;namedtuple&lt;/code&gt; could help with this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Color = namedtuple('Color', ['red', 'green', 'blue'])
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;What does this change? Well, building a &lt;code&gt;Color&lt;/code&gt; is almost the same as building
that tuple you were previously building. Instead of doing &lt;code&gt;(i, j, k)&lt;/code&gt;, you'll
now write &lt;code&gt;Color(i, j, k)&lt;/code&gt;. This is perhaps a little more readable, and it adds
some more semantic meaning to your code. We're no longer just building a tuple;
we're building a &lt;code&gt;Color&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The real win for &lt;code&gt;namedtuple&lt;/code&gt; is in access to its elements.  Before, to get the
red value for a color &lt;code&gt;c&lt;/code&gt;, we would use brackets: &lt;code&gt;c[0]&lt;/code&gt;. By comparison, if we
have a &lt;code&gt;Color&lt;/code&gt; called &lt;code&gt;c&lt;/code&gt;, we could use a more friendly dot syntax: &lt;code&gt;c.red&lt;/code&gt;. In
my experience, while not having to remember the index of the red element is
nice, the real win is in how much more readable &lt;code&gt;c.red&lt;/code&gt; is in contrast to
&lt;code&gt;c[0]&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;&lt;code&gt;collections.defaultdict&lt;/code&gt; and &lt;code&gt;collections.Counter&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Suppose your interviewer asks you to find the most common string in a list of
strings. We can solve this problem using a &lt;code&gt;defaultdict&lt;/code&gt; (let's call it &lt;code&gt;d&lt;/code&gt;). We
could loop through the list, incrementing &lt;code&gt;d[elem]&lt;/code&gt; for each element. Then, we
just find the one we saw most. The implementation would look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def most_common_dd(lst):
    d = defaultdict(int)
    for e in lst:
        d[e] += 1

    return max(d.iteritems(), key=lambda t: t[1])
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apparently, users and maintainers of Python saw this pattern enough that they
decided to create &lt;code&gt;Counter&lt;/code&gt;. &lt;code&gt;Counter&lt;/code&gt; lets us write a much more succinct
version of this function, because &lt;code&gt;Counter&lt;/code&gt; encapsulates the process of counting
the number of ocurrences of elements in an iterable. Implementing this
functionality with a `Counter object would look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def most_common_ctr(lst):
    return Counter(lst).most_common(1)[0]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These both have the same result:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from collections import Counter, defaultdict

strings = ['bear', 'down', 'you', 'bears', 'of', 'old', 'baylor', 'u', "we're",
        'all', 'for', 'you', "we're", 'gonna', 'show', 'dear', 'old', 'baylor',
        'spirit', 'through', 'and', 'through', 'come', 'on', 'and', 'fight',
        'them', 'with', 'all', 'your', 'might', 'you', 'bruins', 'bold', 'and',
        'win', 'all', 'our', 'victories', 'for', 'the', 'green', 'and', 'gold']

'''
definitions for most_common_ctr and most_common_dd
'''

assert most_common_dd(strings) == most_common_ctr(strings)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But the version using &lt;code&gt;Counter&lt;/code&gt; is more concise.&lt;/p&gt;
&lt;h2&gt;Comprehensions&lt;/h2&gt;
&lt;p&gt;I love list comprehensions. They can make code much more concise and readable.
Consider a problem where we have a start point and an end point on a grid:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;|S|_|_|_|
|_|_|_|_|
|_|_|_|_|
| | | |E|
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let's further say that from a given cell, you can travel up, down, left or right
into another cell (but not diagonally). We may want to do a bread-first search
to find the minimum cost to get from the start to the end. At some point, we'll
need to push the neighbors of the current cell onto the queue we're using for
the BFS. This could look something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;for neigh in neighbors(cell):
    # validate neigh
    queue.append(neigh)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;How should &lt;code&gt;neighbors(cell)&lt;/code&gt; work? Well, we could use a double for loop to
generate the neighbors:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def neighbors(cell):
    for i in range(-1, 2):
        for j in range(-1, 2):
            if i == 0 and j == 0 or abs(i) + abs(j) &amp;gt; 1:
                continue
            yield (cell[0] + i, cell[0] + j)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This works, but it's ugly. Instead, we could use a list comprehension:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DIRS = [(0, 1), (0, -1), (1, 0), (-1, 0)]
def neighbors(cell):
    return [(cell[0] + d[0], cell[1] + d[1]) for d in DIRS]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We're also probably going to want to keep track of which cells we've already
visited (so we don't try to go back through them). We could create a matrix of
&lt;code&gt;bool&lt;/code&gt;s the same size as our original grid (let's call it &lt;code&gt;visited&lt;/code&gt;) and set
&lt;code&gt;visited[r][c]&lt;/code&gt; when we visit the cell located at row &lt;code&gt;r&lt;/code&gt; and column &lt;code&gt;c&lt;/code&gt;. But
how should we initialize this matrix? We could do something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;visited = []
for i in range(n):
    visited.append([])
    for j in range(n):
        visited[i].append(False)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But list comprehensions can make this much more concise:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;visited = [[False for _ in range(n)] for _ in range(n)]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The possibilities with list comprehensions are just about endless, so I'll leave
it at that!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/python-tools-for-interviews.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 17 Oct 2015 18:03:09 GMT</pubDate></item><item><title>Hackathon report: TAMUHack 2015</title><link>https://www.samueltaylor.org/articles/tamuhack-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;Idea&lt;/h2&gt;
&lt;p&gt;Two weeks ago, &lt;a href="/articles/hacktx-2015.html"&gt;I went to HackTX&lt;/a&gt; and won a Leap
Motion. While thinking about what things we could build with skeletal tracking
and gesture recognition, I thought it would be cool to build a language learning
tool (like Rosetta Stone) for American Sign Language. My friend
&lt;a href="http://matttinsley.org/"&gt;Matt&lt;/a&gt; also thought it sounded cool, so we decided to
build something like that at TAMUHack.&lt;/p&gt;
&lt;h2&gt;Environment&lt;/h2&gt;
&lt;p&gt;TAMUHack was fun. The venue was called "The Zone", which is a big room in A&amp;amp;M's
stadium. All 300 of us were in this massive room, along eight big tables. Being
in the same room as everyone else was really cool; you felt like you were all a
part of something. I've been to other hackathons where I've not been able to
find a seat in the main areas; being separated from the rest of the teams is not
fun. The organizers of TAMUHack found a great solution to that problem--put
everyone together!&lt;/p&gt;
&lt;h2&gt;Project evolution&lt;/h2&gt;
&lt;p&gt;We started to build something that would simply transcribe signs of the ASL
alphabet as a user signed them above the Leap Motion. By around
&lt;a href="https://github.com/ssaamm/sign-language-tutor/commit/35aac3168fd37a984087dc269607aa815e382fc8"&gt;3:00am&lt;/a&gt;,
we had that more or less working. Playing around with it, we knew it definitely
wasn't perfect, but it showed promise.&lt;/p&gt;
&lt;p&gt;The Leap Motion is not particularly well-suited to sign language recognition. In
our research during the hackathon, we found a research paper that said the Leap
Motion in its current state &lt;a href="http://www98.griffith.edu.au/dspace/bitstream/handle/10072/59247/89839_1.pdf"&gt;isn't a good choice for recognizing Australian Sign
Language&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;As an initial run at solving this, we decided to implement some simple Markov
chain analysis. The idea was that if certain letters commonly precede others, we
should be able to figure out that a person signing "q" will probably sign "u"
next. That idea didn't end up helping us out all that much; we tore it out
later. After we input some more training data, the recognition was good enough
that we could work with it.&lt;/p&gt;
&lt;p&gt;At that point, we had some time left and felt like we could keep going to make
something cooler than what we had. We decided to make the language learning tool
we'd originally planned on making. By
&lt;a href="https://github.com/ssaamm/sign-language-tutor/commit/90bed1fbe605bec0932b3ccb6e41e685dc70ea3b"&gt;8:00am&lt;/a&gt;,
we had a basic version working. The app would show you a picture of a sign and
ask you to replicate it. Once you had, it would give you 100 points and pick
another sign for you to make. After 30 seconds, you could enter your score on
our leaderboard. We decided to make our project into a game because it seemed
like a fun way to demo the tech we had built.&lt;/p&gt;
&lt;p&gt;We ended up
&lt;a href="https://github.com/ssaamm/sign-language-tutor/commit/c0fb39b539a579419516ae6e6bcfc2b59452caf2"&gt;finishing&lt;/a&gt;
about an hour before projects were due. We were so happy to have built something
so cool and fun to make in such a short amount of time.&lt;/p&gt;
&lt;h2&gt;Presentations&lt;/h2&gt;
&lt;p&gt;We set up our area with our laptops and the two external monitors we brought. We
each ran a copy of our app on an external monitor and had the Leap Motion
visualizer on our laptop screens. This ended up being super useful; we could
show people what the Leap Motion was seeing in real time.&lt;/p&gt;
&lt;p&gt;Getting to show off our project to judges and other hackers was super fun.
People thought it was super cool and were excited to play around with it.&lt;/p&gt;
&lt;p&gt;We got into the top six and were asked to present at closing cermonies. Awesome!
It was a little rushed because things were running late, but I still enjoyed
getting to talk about our hack in front of everyone.&lt;/p&gt;
&lt;p&gt;Apparently the judges also thought it was cool, because Matt and I won second
place overall!&lt;/p&gt;
&lt;h2&gt;Thanks&lt;/h2&gt;
&lt;p&gt;TAMUHack was super fun. Huge thanks to the organizers, volunteers, and judges.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/tamuhack-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sun, 11 Oct 2015 15:39:45 GMT</pubDate></item><item><title>Political implications of BitTorrent</title><link>https://www.samueltaylor.org/articles/bittorrent.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;BitTorrent is an inherently political technology which embodies decentralized
political order. Additionally, it broadens the definition of art. Despite the
negative side effects of the technology, BitTorrent is worth pursuing.&lt;/p&gt;
&lt;p&gt;What does it mean to say that a technology is political? Winner outlines two
possibilities:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I…offer…two ways in which artifacts can contain political properties. First
are instances in which the invention, design, or arrangement of a specific
technical device or system becomes a way of settling an issue in a particular
community.  Second are cases of what can be called inherently political
technologies, man-made systems that appear to require, or to be strongly
compatible with, particular kinds of political relationships.…By ”politics,” I
mean arrangements of power and authority in human associations as well as the
activities that take place within those arrangements.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What does it mean to say that a technology embodies decentralized political
order? Applying Winner’s thought, such a technology would either appear to
require or be strongly compatible with a kind of arrangement of power and
authority with regard to human associations and activities within such
associations. Inasmuch as the information age has made information power, to say
that a technology is strongly compatible with decentralized political
relationships means that the technology decentralizes control of information. We
would expect such technologies to be dangerous to centralized power structures.&lt;/p&gt;
&lt;p&gt;BitTorrent decentralizes control of information, and thereby embodies
decentralized political order. Each person in the swarm has the power to
download the information, and each person in the swarm also has the
responsibility to upload the information to their peers. In fact, with the usage
of distributed hash table technology, even a centralized tracker is unnecessary;
peers can coordinate file transfer themselves without the need for a tracker
(BitTorrent.org).&lt;/p&gt;
&lt;p&gt;Decentralization of information control and access is the natural end of
BitTorrent. Imagine that BitTorrent exists in some centralized power structure
where only one torrent tracker exists. In order for some authoritarian,
centralized power to keep said power, it would have to be able to keep control
over the way that BitTorrent is used. And for a time, that central power could
control BitTorrent. But starting a new tracker is so easy that controlling such
action over a long period of time would be almost impossible. To prevent people
from starting new trackers and attracting users away from the Official TrackerTM
would require coercion on a scale that is hard to imagine, let alone implement.&lt;/p&gt;
&lt;p&gt;Furthermore, BitTorrent is dangerous to centralized power structures. Look at
the example of music. Record labels are a powerful, centralized entity in the
realm of music. If BitTorrent is a threat to centralized power, then we should
expect to see record labels seeking to control BitTorrent. The Recording
Industry Association of America (an organization made of record
labels/distributors) seeks to exercise control over the ways in which people use
BitTorrent. The RIAA targets both users and trackers, attempting to get such
high punishments as to scare people away from using BitTorrent for sharing
music.&lt;/p&gt;
&lt;p&gt;Consider another example of a centralized power structure: the People’s Republic
of China. China censors much of the internet and has started to censor
BitTorrent websites (Van der Sar). These two examples of centralized powers
fighting to control BitTorrent provide a compelling argument that there is
something about BitTorrent which encourages decentralization of power.&lt;/p&gt;
&lt;p&gt;Another way that BitTorrent decentralizes power is in the way that people
discover content. Again, consider the example of music. In the past, certain
people have had much more power over the sharing of music than others; radio
DJ’s, journalists, and the like were able to exert power over the music people
listen to. Before the advent of peer-to-peer technology, a non-DJ’s ability to
share music with others was limited to those physically/geographically nearby.
Peer-to-peer technology allows for music sharing to occur through the internet;
a user can now share her favorite music with anyone (Franchini). BitTorrent
enables users to share not only the knowledge of some song or artist, but the
very music itself.&lt;/p&gt;
&lt;p&gt;Perhaps the most beneficial societal contribution offered by BitTorrent is the
decentralization of content distribution. It allows creation and distribution of
art to happen without the support of powerful backers. Before BitTorrent,
distributing a television show required some power over the broadcasters. Even
in the internet age, the bandwidth costs of distributing a “television” show can
be very expensive. The unique opportunity afforded by BitTorrent is to share the
load of distributing the content among a large “swarm” of peers. Because it
drives distribution costs down, BitTorrent liberates content creators from
distributors; they can distribute their own content.&lt;/p&gt;
&lt;p&gt;It also changes the ways in which users support their favorite artists. In a
world overwhelmed with file sharing, supporting an artist has become less about
buying the artist’s physical records/CD’s and more about buying band merchandise
and tickets to concerts (Franchini). This change in artists’ revenue models from
being primarily based on selling albums to being based on selling concert
tickets and merchandise is also recognized by the artists themselves. Winston
Marshall, the guitarist for Mumford &amp;amp; Sons, says that “Music is changing.…We
look at our albums as…adverts for our live shows” (Stern).&lt;/p&gt;
&lt;p&gt;Combining these effects, BitTorrent decreases the distance between content
creators and content consumers, thereby encouraging more people to become
content creators. Consumers no longer must go through a middle man to access
their favorite creators’ work. They also take an active role in the re-creation
of said work. As a result, consumers develop more direct relationships with the
creators of content they like. Finally, because distribution costs are lower,
consumers are more likely to become creators, and they will not have to seek the
help of powerful distribution/broadcasting middle men. BitTorrent removes the
necessity for a powerful middle man.&lt;/p&gt;
&lt;p&gt;A counter-argument to the claim that BitTorrent embodies decentralizes power is
that certain players in the BitTorrent ecosystem possess more power than others.
The Pirate Bay, for instance, is a huge tracker which has lots of power.
However, the existence of powerful players within a system does not imply a lack
of decentralization of power. Users can still choose whether to use the
mega-websites or the smaller ones. An abundance of torrent websites still exist
and have power. This means that even though some are more powerful than others,
power is largely decentralized in the BitTorrent ecosystem.&lt;/p&gt;
&lt;p&gt;This decentralization of power is a good thing. Distributed power is inherently
good in a society which values not being dominated by another person. If power
is centralized, then the entity with the power is able to dominate whomever they
so desire. American society values not being dominated, so this decentralization
of power brought about by BitTorrent is good for society.&lt;/p&gt;
&lt;p&gt;Another effect of BitTorrent (distinct from the decentralization of power) is
that it changes what the word “art” means. Rodriguez-Ferrandiz dicsusses the
effect that digital copies have on art as a whole in an abstract sense. In
essence, the importance of the “original” work becomes less important.
Possessing the original version of a song does not matter all that much when
every copy of a song is perfect. In that BitTorrent makes the recreation of art
extremely inexpensive and completely accurate, the quality and accessibility of
copies of individual works of art mean that having the original is not
significantly better than having a copy for most individuals.&lt;/p&gt;
&lt;p&gt;Rodriguez-Ferrandiz specifically writes of photography, noting that it has
caused “the focus of interest” to switch “from the work as a singularity that
physically retains the creator’s touch to a vision of the work as a multipliable
and liberated piece which removes distinctions between original and copy”. An
earlier author, Benjamin, who is cited by Rodriguez-Ferrandiz refers to the
distinction between original and copy as the “aura.” What of digital art, then,
for which there is no difference between originals and copies?
Rodriguez-Ferrandiz argues that a “paradoxical aura” exists for such art.
Because the art is not defined by the way that it is represented in binary on a
hard disk, it “transcends physical form” and becomes “immortal.” Though
BitTorrent does not qualitatively change this trend or contribute to it in a
novel way, it does offer a quantitatively larger realization of this immortality
by making the reproduction of digital art far easier.&lt;/p&gt;
&lt;p&gt;This change in the meaning of art is a good thing. It broadens art to include
digital arts, giving artists a new medium for creativity. In American society,
creativity is valued, so this change is good for society.&lt;/p&gt;
&lt;p&gt;Of course, the technology is not without its drawbacks. Nothing inherent to the
protocol stops its use from including mass distribution of child pornography or
other unquestionably bad things. The entire idea of decentralized control is
antithetical to the censorship of BitTorrent as a medium (whether or not the
censorship is of things society generally agrees are bad). The government will
not be able to stop the spread of child pornography through BitTorrent.&lt;/p&gt;
&lt;p&gt;This inability to censor terrible things is not a reason to stop usage of
BitTorrent technologies. First, this problem is not unique to BitTorrent. Many
technologies make spreading morally repulsive content much easier (e.g. the
internet, books, compact disks, pencils). Second, BitTorrent requires a large
swarm of users for effectiveness. To say that certain things are generally
agreed upon to be unacceptable in a society implies that the number of people
who will participate in such behavior is low. Thus, BitTorrent is a bad fit for
child pornographers and sharers of other repulsive content.&lt;/p&gt;
&lt;p&gt;BitTorrent embodies decentralized political order. It broadens the definition of
art. Because these are both good things, BitTorrent is a technology that is
worth pursuing despite its drawbacks.&lt;/p&gt;
&lt;h2&gt;Works cited&lt;/h2&gt;
&lt;p&gt;BitTorrent.org. “BEP 5: DHT Protocol.” &lt;a href="http://www.bittorrent.org/beps/bep_0005.html"&gt;Link&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Van der Sar, Ernesto. “China Hijacks Popular BitTorrent Sites.” TorrentFreak. 8
Nov 2008.&lt;/p&gt;
&lt;p&gt;Rodriguez-Ferrandiz, Raul. “Benjamin, BitTorrent, bootlegs: auratic piracy
cultures?.” International journal of communication.&lt;/p&gt;
&lt;p&gt;Stern, Marlow. “Mumford &amp;amp; Sons Diss Jay Zs Tidal.” The Daily Best. 12 April
2015.&lt;/p&gt;
&lt;p&gt;Winner, Langdon. “Do Artifacts Have Politics?” Daedalus, Vol. 109, No. 1, Modern
Technology: Problem or Opportunity? (Winter, 1980), pp. 121-136. &lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/bittorrent.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Thu, 08 Oct 2015 01:42:24 GMT</pubDate></item><item><title>Hackathon report: HackTX 2015</title><link>https://www.samueltaylor.org/articles/hacktx-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;Idea&lt;/h2&gt;
&lt;p&gt;My team and I started thinking of ideas on the drive down to Austin. We ended up
deciding to do something that would automate the scheduling of meetings.&lt;/p&gt;
&lt;h2&gt;Project&lt;/h2&gt;
&lt;p&gt;We started writing a Python/Flask app hosted on Azure. Getting the project
deployed initially was easy, as was setting up continuous integration. Last
year, we wrote a PHP web app, and none of us were able to use our computers to
test. In essence, the production server was also our development server. Using
Flask was awesome because it comes with a development server.&lt;/p&gt;
&lt;p&gt;Wes started to work on the UI, I worked on figuring out a way to read people's
emails with Context.IO, and Evan worked on user account management. He started
to use Flask-User, but then we couldn't get it to work on the Azure
configuration we had set up. It was late at night, I wasn't sure what other
library to use, and Wes was starting to hate Python, so we made a hard decision
and switched everything to PHP.&lt;/p&gt;
&lt;p&gt;At this point, we had to set up DeployBot to do continuous integration, and we
went back to the issue of not having servers to do development on. As a result,
the git log got pretty terrible.&lt;/p&gt;
&lt;p&gt;I got a script to check a user's email inbox for emails that looked like someone
trying to schedule a meeting and set it up to run on a cron job while Evan
worked on our SendGrid integration.&lt;/p&gt;
&lt;p&gt;The actual scheduling logic came into play much later in the day than we would
have hoped. Luckily, it didn't turn out to be too challenging, so we were able
to get it implemented and finally create our app's core functionality.&lt;/p&gt;
&lt;p&gt;In the end, our product was definitely more hacky than any of us would have
liked, but it worked well enough to demonstrate.&lt;/p&gt;
&lt;h2&gt;Presentation&lt;/h2&gt;
&lt;p&gt;Once again, presentations were "science fair" style this year, which was great.
Several judges came around and asked about our project. Our pitch was something
like this:&lt;/p&gt;
&lt;p&gt;I've been doing job hunting lately, which involves a good amount of emailing
back and forth to coordinate interviews. This process is a tedious chore; sounds
like a job for computers! We built Schedule Ninja, an awesome computerized ninja
that slices and dices your meetings so you don't have to.&lt;/p&gt;
&lt;p&gt;Users log in with their Google account, which we use to pull in their
availability through Google Calendar. We read their email in order to find
messages that look like someone trying to set up a meeting. From those emails,
we generate a request on the user's dashboard that they can either accept or
deny. If they accept the request, Schedule Ninja emails the requester back with
the user's availability and asks them to click a link to confirm their meeting
time.&lt;/p&gt;
&lt;p&gt;Schedule Ninja can also be used to request a meeting with someone else. The user
types in an email address, and we detect whether that person is on our service.
If they are, we are able to avoid email altogether, compare the two people's
schedules, and set up a meeting for them automatically.&lt;/p&gt;
&lt;p&gt;We got some great feedback from several judges who wanted to sign up for the
service immediately. That felt very validating; we had built something users
actually wanted! Unfortunately, we didn't place overall, but we did end up
winning sponsor prizes from Microsoft and Indeed.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We had a lot of fun and built a useful product in 24 hours. Big thanks to the
HackTX organizers, Context.IO, Square, Microsoft, and Indeed for all their
feedback and help. If you've not gone to a hackathon, you should definitely sign
up! They're super fun!&lt;/p&gt;
&lt;p&gt;If you want to get in touch for any reason, I can be reached at
sgt@samueltaylor.org. Thanks!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/hacktx-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Tue, 29 Sep 2015 01:36:29 GMT</pubDate></item><item><title>The 10 Best Ingredients for Cheap Cooking</title><link>https://www.samueltaylor.org/articles/best-ingredients-for-cheap-cooking.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;The ten most frequently-used ingredients on &lt;a href="http://budgetbytes.com"&gt;Budget
Bytes&lt;/a&gt; are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Salt&lt;/li&gt;
&lt;li&gt;Garlic&lt;/li&gt;
&lt;li&gt;Olive oil&lt;/li&gt;
&lt;li&gt;Eggs&lt;/li&gt;
&lt;li&gt;Brown sugar&lt;/li&gt;
&lt;li&gt;Oregano&lt;/li&gt;
&lt;li&gt;Water&lt;/li&gt;
&lt;li&gt;Cumin&lt;/li&gt;
&lt;li&gt;Yellow onion&lt;/li&gt;
&lt;li&gt;Pepper&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This information was gathered using a &lt;a href="https://github.com/ssaamm/recipe-scraper/"&gt;scraper I
wrote&lt;/a&gt; with Python 3, BeautifulSoup,
and Requests.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/best-ingredients-for-cheap-cooking.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 12 Sep 2015 15:51:54 GMT</pubDate></item><item><title>Notes: Boston Python User Group, Lightning Talks, 22 June 2015</title><link>https://www.samueltaylor.org/articles/notes-boston-python-22-jun-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;On 22 June 2015, the Boston Python User Group had a night of seven lightning
talks. These are notes I took for personal use; they're not a perfect re-telling
of what each talk was about (or even what each talk was called).&lt;/p&gt;
&lt;h2&gt;#1 Python for making connections in groups&lt;/h2&gt;
&lt;h3&gt;Speaker: John Hess&lt;/h3&gt;
&lt;p&gt;John and a friend distant from him in his social graph each ended up being stood
up by friends at the same bar. They decided to sit down and solve the world's
problems. They ended up enjoying their time, so John wanted to find a way to
automate this sort of process.&lt;/p&gt;
&lt;p&gt;The idea is something like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A group of people sign up on &lt;a href="https://www.mavenbot.com"&gt;Maven&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The service selects a group of about 4-6 people and texts them to see if
  they're available&lt;/li&gt;
&lt;li&gt;As people accept or decline invitations to the event, Maven will text more
  people from the group to get them in on the event&lt;/li&gt;
&lt;li&gt;Maven then puts everyone into a group message so they can organize the event&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;John found that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Building stuff is easy in Python&lt;/li&gt;
&lt;li&gt;Python is a Swiss army knife, but it can't do everything (for example, mobile
  development)&lt;/li&gt;
&lt;li&gt;While building stuff is easy, building stuff that is user friendly is &lt;em&gt;really
  hard&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;John kept iterating, putting the product in front of friends, getting
    feedback, and trying new things&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;#2 Django cloud management&lt;/h2&gt;
&lt;h3&gt;Speaker: Robert Paul Chase&lt;/h3&gt;
&lt;p&gt;I was semi-lost on this one. The project was related to genetics somehow, and I
know nothing about computational genetics.&lt;/p&gt;
&lt;p&gt;He built a cloud management platform that lets biologists and researchers (read:
not developers) easily spin up nodes, install necessary software, run their
code, and kill their cluster when they're done with it.&lt;/p&gt;
&lt;h2&gt;#3 &lt;code&gt;.format()&lt;/code&gt;ing without tears&lt;/h2&gt;
&lt;h3&gt;Speaker: Richard Landau&lt;/h3&gt;
&lt;p&gt;The standard &lt;code&gt;str.format()&lt;/code&gt; method in Python will throw a &lt;code&gt;KeyError&lt;/code&gt; if a name
isn't found in the dictionary. Rick made his own function to avoid that problem.
Here's how it works:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uses a regex to get the names both with and without braces (e.g. "{foo}" and
  "foo")&lt;/li&gt;
&lt;li&gt;Zips those two arrays together (to get &lt;code&gt;('foo', '{foo}'), ('bar', '{bar}'),
  ('baz', '{baz}')&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Constructs a dictionary from that array (the format of the tuples will work
  with &lt;code&gt;dict()&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At first, it seemed to me like another way to implement this behavior would be
to provide &lt;code&gt;.format()&lt;/code&gt; with a dictionary that, instead of throwing a &lt;code&gt;KeyError&lt;/code&gt;
when encountering an unknown key, would return a modified version of the key
which was asked for. I tried to do that, and it turns out that doesn't work&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;class FancyDict(object):
    def __init__(self, dictionary):
        self.__dictionary = dictionary

    def __getitem__(self, key):
        try:
            return self.__dictionary[key]
        except KeyError:
            return '{' + key + '}'

    def keys(self):
        return self.__dictionary.keys()

if __name__ == '__main__':
    params = { 'foo': 'this is foo', 'bar': 'this is bar', 'baz': 'this is baz' }
    print '{foo} {bar} {baz}'.format(**params)
    # this is foo this is bar this is baz

    params = { 'foo': 'this is foo', 'bar': 'this is bar' }
    print '{foo} {bar} {baz}'.format(**params)
    # KeyError: 'baz'

    params = FancyDict({ 'foo': 'this is foo', 'bar': 'this is bar' })
    print '{foo} {bar} {baz}'.format(**params)
    # KeyError: 'baz'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;str.format()&lt;/code&gt; gets the keys of the dictionary and will throw a &lt;code&gt;KeyError&lt;/code&gt; if
any of the strings in curly braces are absent.&lt;/p&gt;
&lt;h2&gt;#4 Test all the data&lt;/h2&gt;
&lt;h3&gt;Speaker: Eric J Ma&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Testing data is important because you have some assumptions about it that may
  not always be correct&lt;/li&gt;
&lt;li&gt;He talked some more about how to do that using PyTest&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;#5 Visualizing Yeast ChIP-Seq data&lt;/h2&gt;
&lt;h3&gt;Speaker: Luis Soares&lt;/h3&gt;
&lt;p&gt;I was completely out of my league on the domain of this one, which was something
related to biology.&lt;/p&gt;
&lt;p&gt;It looked like a neat web-based visualization project.&lt;/p&gt;
&lt;h2&gt;#6 Payment reform&lt;/h2&gt;
&lt;h3&gt;Speaker: James Santucci&lt;/h3&gt;
&lt;p&gt;I wasn't super familiar with the domain (statistics).&lt;/p&gt;
&lt;p&gt;The big takeaway was that how we measure value affects how much value we
observe. I'm not sure what that means.&lt;/p&gt;
&lt;h2&gt;#7 Hypothesis: property-based testing&lt;/h2&gt;
&lt;h3&gt;Speaker: Matt Bachmann&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Hypothesis is a Python library inspired by Haskell's QuickCheck&lt;/li&gt;
&lt;li&gt;You put a decorator on your test to say what kind of data it takes&lt;/li&gt;
&lt;li&gt;It works with most testing frameworks&lt;/li&gt;
&lt;li&gt;You write a small amount of code, but get a big amount of functionality tested&lt;/li&gt;
&lt;/ul&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/notes-boston-python-22-jun-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 27 Jun 2015 20:44:53 GMT</pubDate></item><item><title>Notes: Boston Django Meetup, Intro to Flask, 25 June 2015</title><link>https://www.samueltaylor.org/articles/notes-boston-django-25-jun-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;On 25 June 2015, Ned Jackson Lovely spoke about Flask at the Boston Django Users
Meetup Group. I took some notes and am putting them here so I don't lose them;
they may or may not be useful to others. Theses notes are not exhaustive.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Functions that are decorated with &lt;code&gt;@flaskapp.route()&lt;/code&gt; should return:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a string&lt;/li&gt;
&lt;li&gt;a tuple of (response, status, headers)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;flask.Response&lt;/code&gt; / &lt;code&gt;current_app.response_class&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;a WSGI callable&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In theory, your function could return another Flask app, or spit out a WSGI
    callable that generated a massive CSV on the fly and streamed it to the user&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You can do testing:&lt;/p&gt;
&lt;p&gt;def test_splash():
    client = app.test_client()
    response = client.get('/')
    assert response.status_code == 200
    assert 'form' in response.data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The Werkzeug debugger is awesome&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Defining filters for templates is possible (and ostensibly simple)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sessions in Flask are interesting&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Session data is serialized to JSON, cryptographically signed, and set in the
  user's browser as a cookie&lt;/li&gt;
&lt;li&gt;Because it's client side, it doesn't matter which datacenter they end up in&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you write to a "second level" in &lt;code&gt;session&lt;/code&gt;, you need to set
  &lt;code&gt;session.modified = True&lt;/code&gt; for the changes to get written out:&lt;/p&gt;
&lt;p&gt;session['first']['second'] = 'new thing'
session.modified = True&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Flashing is done with sessions, and is useful for displaying those one-time,
  web app-y messages like "Your post was submitted"&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;On a general Python note, contexts (&lt;code&gt;with&lt;/code&gt;/&lt;code&gt;as&lt;/code&gt;) are really cool -- you only
  have to implement two &lt;code&gt;__&lt;/code&gt; methods to get the benefits of them&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Some useful libraries: SQLAlchemy, WTForms&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Blueprints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Helpful when your app gets bigger&lt;/li&gt;
&lt;li&gt;They're very similar to &lt;code&gt;Flask&lt;/code&gt; objects with an additional "namespace"&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;app.register_blueprint(bp, url_prefix='/counter')&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To get GET/POST data, use &lt;code&gt;request.values&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Deploying:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;gevent, gunicorn, nginx, ansible  &lt;/li&gt;
&lt;li&gt;supervisor  &lt;/li&gt;
&lt;li&gt;pip freeze&lt;/li&gt;
&lt;/ul&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/notes-boston-django-25-jun-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 27 Jun 2015 20:44:53 GMT</pubDate></item><item><title>Hackathon report: Hack@Teal 2015</title><link>https://www.samueltaylor.org/articles/hackteal-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;My friend &lt;a href="http://matttinsley.org/"&gt;Matt Tinsley&lt;/a&gt; and I were both RA's in Teal
Resdiential College from Fall 2014 to Spring 2015. We are also both big fans of
hackathons. Matt had the idea for and led the organization of a hackathon for
Teal residents. I helped him out with logistics, and we worked on a project in
the time we weren't doing organizer-y things.&lt;/p&gt;
&lt;p&gt;The first two sections of this article relate to organizing the event, and the
third is about the project I worked on.&lt;/p&gt;
&lt;h2&gt;Successes&lt;/h2&gt;
&lt;p&gt;Lots of things went well.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Learning&lt;/strong&gt; -- Many people during demos said something to the effect of,
  "I've never done anything like this project before today. I learned so much."
  Our faculty master was impressed with how much people were able to learn,
  which was good seeing as he funded much of the event. Education was also a
  huge reason we wanted to have the event, so it was great to see our efforts
  pay off.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Team/idea formation&lt;/strong&gt; -- At the beginning of the event, we had some time for
  individuals who came to the event to find a team. Matt and I stood in a circle
  with the individuals, and we all went around and said an idea we had. This
  format worked well; the four people we had formed two teams. I would recommend
  that organizers bring some simple ideas for people to work on; both of my
  ideas ended up getting used.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Venue&lt;/strong&gt; -- Teal allowed us to use the media room, which had ample space. We
  were also able to bring over plenty of tables from a neighboring classroom.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google forms&lt;/strong&gt; -- good choice for our voting needs. I could imagine that for
  larger hackathons, there would be too many votes to use Google forms, but it
  was just what we needed for how many participants we had.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Food&lt;/strong&gt; -- We had ample food, people seemed to like it well enough.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hosting&lt;/strong&gt; -- We had one team who wanted to do some stuff with web
  technologies, but didn't have much of an idea what they were doing. I spun up
  a VPS through &lt;a href="https://www.digitalocean.com/?refcode=3dc2cdcca705"&gt;Digital
  Ocean&lt;/a&gt;, and &lt;a href="http://www.wescossick.com/"&gt;Wes
  Cossick&lt;/a&gt; explained a few basic things to them.
  Like that, they were off!&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Hardships&lt;/h2&gt;
&lt;p&gt;A few things were less than perfect.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Staying through the night&lt;/strong&gt; -- I wish more people had chosen to stay through
  the night. Around 4 or 5 am, the room started to feel dead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Judging&lt;/strong&gt; -- Our faculty master did a great job of judging, but I think it
  would have been cool to have gotten a panel of judges rather than just one
  person.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;My project&lt;/h2&gt;
&lt;p&gt;Matt and I worked on &lt;a href="https://github.com/ssaamm/emoji-predictor"&gt;Emoji
Predictor&lt;/a&gt;. If you've ever used a
smartphone keyboard that suggests words as you type, you'll understand what it
is. Our project suggests relevant emojis for you to use in your text messages.&lt;/p&gt;
&lt;p&gt;We started with a database of all Matt's text messages. This would be our
"corpus", or the body of text we would use to make inferences about which emojis
should be used with which kinds of messages.&lt;/p&gt;
&lt;p&gt;While making an iOS keyboard would have been really cool, we wanted to make a
proof of concept and focus on the part of the project we found interesting:
getting from a string to the emojis most relevant to it. We decided to make a
web UI. I whipped up a simple application using Python, Flask, and JavaScript
(our code can be found on
&lt;a href="https://github.com/ssaamm/emoji-predictor"&gt;GitHub&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;While I was working on the UI side of things, Matt started working on the
recommendation engine using Python Natural Language Toolkit. While he worked on
that, I decided to work on a different implementation of a recommendation
engine. I loaded all of Matt's sent messages which contained emojis into
Elasticsearch and ran a query on that index using user input. This basic
implementation ended up working decently enough.&lt;/p&gt;
&lt;p&gt;Matt ended up having tons of trouble with Python and unicode, so for demo
purposes we went with my implementation. I thought our product was pretty neat.&lt;/p&gt;
&lt;p&gt;Because it relied on Matt's personal information, a live demo unfortunately
isn't up anywhere.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Despite a few minor problems, Hack@Teal went very well. I was glad to help, and
I'd love to take part in organizing more hackathons. Because it was a small
event, Matt and I were able to hack on our own project, which was fun and
educational.&lt;/p&gt;
&lt;p&gt;If you want more information (especially about other people's projects), please
see the official website for Hack@Teal 2015, &lt;a href="http://hackteal.me/"&gt;hackteal.me&lt;/a&gt;.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/hackteal-2015.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 27 Jun 2015 20:44:53 GMT</pubDate></item><item><title>How to: remove Etsy search ads</title><link>https://www.samueltaylor.org/articles/how-to-remove-etsy-search-ads.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;ol&gt;
&lt;li&gt;Install Greasemonkey or Tampermonkey&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Firefox users:
   &lt;a href="https://addons.mozilla.org/en-us/firefox/addon/greasemonkey/"&gt;Greasemonkey&lt;/a&gt;&lt;br /&gt;
   Chrome users:
   &lt;a href="https://chrome.google.com/webstore/detail/tampermonkey/dhdgffkkebhmkfjojejmpbldmpobfkfo"&gt;Tampermonkey&lt;/a&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install Etsy Ad Remover&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a href="https://github.com/ssaamm/greasemonkey-scripts/raw/master/Etsy_Ad_Remover.user.js"&gt;Click this link to
   install&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This script simply removes ads from search result on Etsy. If you're curious,
   check out the &lt;a href="https://github.com/ssaamm/greasemonkey-scripts/blob/master/Etsy_Ad_Remover.user.js"&gt;source code on
   GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Implementation&lt;/h2&gt;
&lt;p&gt;Etsy search ads have children with a CSS class of &lt;code&gt;.ad-indicator&lt;/code&gt;. It's
literally one line to remove those. This was a fun way to figure out how to make
the web less annoying through using browser dev tools and Greasemonkey.&lt;/p&gt;
&lt;p&gt;If you have any feedback, please contact me at sgt at this domain. Thanks!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/how-to-remove-etsy-search-ads.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 27 Jun 2015 20:44:53 GMT</pubDate></item><item><title>Thoughts on Effective Java</title><link>https://www.samueltaylor.org/articles/effective-java.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;I'm reading the second edition of &lt;em&gt;Effective Java&lt;/em&gt; in a group at work and
writing some thoughts/notes about it here.&lt;/p&gt;
&lt;h2&gt;Chapter 2&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Item 1&lt;/strong&gt; is about static factory methods.&lt;/p&gt;
&lt;p&gt;The leader of my group offered a point that static factory methods can be hard
to mock. I don't have a ton of experience at this time with mocking objects, so
I haven't seen that first hand, but I'll trust him and keep it in mind for the
future.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 1, advantage 4&lt;/strong&gt; says that static factory methods are good because they
reduce verbosity in creating objects. The example they give is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Map&amp;lt;String, List&amp;lt;String&amp;gt;&amp;gt; m = new HashMap&amp;lt;String, List&amp;lt;String&amp;gt;&amp;gt;();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Java 7 introduces the diamond operator, so this can become:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Map&amp;lt;String, List&amp;lt;String&amp;gt;&amp;gt; m = new HashMap&amp;lt;&amp;gt;();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;thereby negating the verbosity-decreasing benefit of static factory methods.&lt;/p&gt;
&lt;p&gt;I'm not trying to say the book is wrong. It says "Revised and Updated for
Java SE 6" on the cover, and for Java 6, that seems like a valid argument. I
just think it's interesting how new language features can change what constitues
a best practice. The book even says, "Someday the language may perform this sort
of type inference on constructor invocations as well as method invocations."&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 2&lt;/strong&gt; is about builders.&lt;/p&gt;
&lt;p&gt;In the process of talking about builders, the author talks about the JavaBeans
pattern. This pattern seems like a terrible idea to me; forgetting to set a
required parameter is relatively easy and could have disastrous results. The
Builder pattern seems like a better choice because it's a way to give the
compiler more information. I would rather have my IDE yell at me at compile time
that my object isn't instantiated correctly than wrestle with bugs at run time.&lt;/p&gt;
&lt;p&gt;Builders do introduce more code to write/maintain/test, but (as my group leader
pointed out) the IDE can generate the class for you.&lt;/p&gt;
&lt;p&gt;The book has required parameters going in the builder's constructor and optional
parameters being set by additional methods. My question is: what if the number
of required parameters gets large? Then you haven't solved your problem at all.
One option would be to move the required parameters into methods, but then
you're not providing the compiler with the information to know that some of the
parameters are required. Yes, the &lt;code&gt;build()&lt;/code&gt; method can check for them and
perhaps throw an exception, but that only happens at run time.&lt;/p&gt;
&lt;p&gt;I think that if you have so many required parameters, there might be something
wrong with your design. Perhaps some arguments are logically related and should
be combined into an instance of a new class which binds them together. I'm not
sure if this is always the case, but it seems like it often would be.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 3&lt;/strong&gt; talks about singletons.&lt;/p&gt;
&lt;p&gt;Our group leader told us to beware of the hidden state that often comes along
with singletons. I have definitely run into that issue. When functions use state
that is not from a parameter, things can get tricky. Knowing what state is used
and how that will affect the execution of the function can be difficult for the
developer.&lt;/p&gt;
&lt;p&gt;While the book says that, "a single-element enum type is the best way to
implement a singleton," our leader disagreed. He argues that using an &lt;code&gt;enum&lt;/code&gt; is
less readable. I had a similar gut feeling when I first read this part of the
Item; I would not have thought to use an &lt;code&gt;enum&lt;/code&gt; to implement a singleton. In my
mind, an &lt;code&gt;enum&lt;/code&gt; represents an enumeration of a few different kinds of things; a
singleton is something that there will only ever be one of. These two ideas seem
at odds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 4&lt;/strong&gt; talks about noninstantiable classes. These classes are often used for
utility methods.&lt;/p&gt;
&lt;p&gt;Again, apparently static things are difficult to mock. And Java's garbage
collector has apparently gotten good enough that doing something like &lt;code&gt;(new
FileUtility()).getPermissions(file)&lt;/code&gt; will result in the created &lt;code&gt;FileUtility&lt;/code&gt;
being garbage collected very quickly. This all happens fast enough that there is
very little performance impact.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 7&lt;/strong&gt; says to avoid finalizers. I learned C++ in school, so the lack of
guarantees with finalizers throws me off. In any case, I've never heard someone
seriously advocate for finalizer usage.&lt;/p&gt;
&lt;h2&gt;Chapter 3&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Item 8&lt;/strong&gt; is about the general contract for equals. The terms used in the
contract are familiar from discrete math.&lt;/p&gt;
&lt;p&gt;For value objects, you want to override &lt;code&gt;equals()&lt;/code&gt;. &lt;code&gt;equals()&lt;/code&gt; gets tricky when
inheritance comes into play, though. One solution to that problem is to not use
inheritance with value object -- make your value objects &lt;code&gt;final&lt;/code&gt;. Inheritance
can be useful for business logic, so feel free to use inheritance in that case.
For classes that implement business logic, though, it doesn't make much sense to
implement &lt;code&gt;equals()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;An interesting thing I hadn't thought about is that &lt;code&gt;instanceof&lt;/code&gt; checks for
null:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;public class Main
{
    public static void main(String[] args)
    {
        String nullString = null;
        System.out.println(nullString instanceof String);
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;false
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One suggestion was to use an &lt;code&gt;EqualsBuilder&lt;/code&gt;, like the one supplied in Apache
Commons. Apparently, equals builders will help you avoid NPE's. To me, this
seems like a cop-out. I don't think it's too terribly difficult to avoid writing
an &lt;code&gt;equals()&lt;/code&gt; method which won't throw an NPE; perhaps I haven't written enough
Java.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 9&lt;/strong&gt; is about &lt;code&gt;hashCode()&lt;/code&gt;. Equal objects must have equal hash codes.&lt;/p&gt;
&lt;p&gt;The book contains an overview of how to write a &lt;code&gt;hashCode()&lt;/code&gt; method that's good
enough. Ground-breaking, cutting-edge, crazy high-performance hashing functions
are going to be class-specific. Sometimes, this is fairly obvious; if your class
has a unique ID, you can just return that as your hash code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 10&lt;/strong&gt; is about &lt;code&gt;toString()&lt;/code&gt;. This method should only be used for debugging
or logging purposes.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;StringBuilder&lt;/code&gt; class gets this wrong--its &lt;code&gt;toString()&lt;/code&gt; method is used for
programmatic access to the string that is being built. It should probably have
another method called &lt;code&gt;build()&lt;/code&gt; to provide programmatic access, and leave
&lt;code&gt;toString()&lt;/code&gt; for logging purposes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 11&lt;/strong&gt; is about &lt;code&gt;clone()&lt;/code&gt;. To be frank, I find the &lt;code&gt;Cloneable&lt;/code&gt; interface
confusing, and I haven't run into a good use for it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Item 12&lt;/strong&gt; is about &lt;code&gt;compareTo()&lt;/code&gt;. The hard thing with &lt;code&gt;compareTo()&lt;/code&gt; is that
it's not particularly explicit about what the "natural" ordering means. By
contrast, a &lt;code&gt;Comparator&amp;lt;&amp;gt;&lt;/code&gt; can have a name which gives developers more
information about how the comparison is done. This explicit information is
probably good.&lt;/p&gt;
&lt;p&gt;As of Java 7, if you break the general contract for comparisons, an exception
will be thrown.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/effective-java.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 27 Jun 2015 20:44:53 GMT</pubDate></item><item><title>How to: get free WiFi at coffee shops</title><link>https://www.samueltaylor.org/articles/how-to-get-free-wifi.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;p&gt;Some coffee shops place time limitations on their WiFi. For example, I recently
went to a Panera that had a 30 minute time limit on their WiFi during lunch
hours.&lt;/p&gt;
&lt;p&gt;Getting around such limits isn't difficult. I'm not sure how ethical it is to do
so, so consider this all merely educational information.&lt;/p&gt;
&lt;p&gt;It seems these kinds of systems track you based on your MAC address. If your MAC
address changes, the system thinks of you as a new user. Changing your MAC
address is easy enough (on Linux):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Figure out what interface you're using.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Run &lt;code&gt;ifconfig&lt;/code&gt; and look for the one that looks like right. Mine was &lt;code&gt;wlan0&lt;/code&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Change your MAC address&lt;/p&gt;
&lt;p&gt;ifconfig wlan0 down
ifconfig wlan0 hw ether a1:b2:c3:d4:e5:f6
ifconfig wlan0 up&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For ease of changing, you can make a script that looks something like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash

ifconfig wlan0 down
ifconfig wlan0 hw ether $1
ifconfig wlan0 up
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And then run it like &lt;code&gt;./change_mac.sh a1:b2:c3:d4:e5:f6&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you want to get in touch, email me at sgt@samueltaylor.org. Thanks!&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/how-to-get-free-wifi.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 27 Jun 2015 20:44:53 GMT</pubDate></item><item><title>Hackathon report: HackTX 2014</title><link>https://www.samueltaylor.org/articles/hacktx-2014.html?utm_source=rss&amp;utm_campaign=st-blog-feed</link><description>&lt;h2&gt;Idea&lt;/h2&gt;
&lt;p&gt;Most college students are aware that registering for classes can be a real pain.
At Baylor, we don't have a waitlist on many of our classes. If you fail to get
into a class, you are doomed to logging in several times a day and hoping a seat
is open. This process is a waste of time; we wanted to automate it.
&lt;a href="http://coursewatch.me/"&gt;CourseWatch&lt;/a&gt; makes registering for classes suck less.&lt;/p&gt;
&lt;p&gt;We thought about creating a mobile app that would run in the background on
users' phones. Such an app would require no infrastructure on our end, which
would be great. However, it would require users to download the app and enter
their login credentials. We decided to go with an SMS-based solution. Users
would text us a course number, and we would text them once their class opened
up. Because signing up was as simple as sending a text, this option would have
low friction, which is good for user acquisition.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.hackerleague.org/hackathons/hacktx-2014/hacks/course-watch"&gt;Here's the project page on
HackerLeague&lt;/a&gt;,
which has some basic information about the project.&lt;/p&gt;
&lt;h2&gt;Project&lt;/h2&gt;
&lt;p&gt;I set up a private GitHub repository on my account and added Wes and Evan as
contributors.&lt;/p&gt;
&lt;p&gt;After we found a work space and set up our laptops/devices Wes started
working on creating the screen scraper, Evan started researching other colleges'
registration systems, and I started setting up our server. We decided
to use Microsoft Azure because the Azure team was offering free hosting. I set
up an Ubuntu server with Apache, MySQL, and PHP.&lt;/p&gt;
&lt;p&gt;Wes helped me set up &lt;a href="http://dploy.io/"&gt;dploy.io&lt;/a&gt; to deploy our code
automatically from the master branch of our repository. I'd not used a
continuous deployment service before and was pleased by the ease with which it
let us deploy our code. A pitfall with using this service was that because our
changes to master were automatically deployed, it was easy to get sloppy and
push untested code to master in order to test it on our server. This problem is
our own fault and only requires more discipline to fix. Within a hackathon,
though, it was not a major issue.&lt;/p&gt;
&lt;p&gt;Once the server was set up, I got to work handling inbound SMS messages. I had
experience with using Twilio for SMS from HackTX 2013 and chose to use it again
because it's easy to use and inexpensive.&lt;/p&gt;
&lt;p&gt;After a few minutes of work, Evan finished with his research and wanted to
continue helping. We didn't have a clear design at that point, so I didn't know
what he could work on. As Wes continued to work on the screen scraper, I spread
out a napkin and started drawing the system out. Through this process, I
identified three main areas: notification subscription (which I was working on),
screen scraping (which Wes was working on), and notifying users. Being that
nobody was working on notifying users, Evan started working on that.&lt;/p&gt;
&lt;p&gt;Evan had never used PHP before, so he needed some direction. He sat at the
keyboard, and I sat next to him. Within an hour or two, he had figured out
enough PHP to be dangerous, so I moved to work on subscription through SMS.&lt;/p&gt;
&lt;p&gt;Wes sort of reverse engineered BearWeb, which was an interesting and tedious
process. He opened up &lt;a href="http://www.charlesproxy.com/"&gt;Charles&lt;/a&gt; and clicked around
Baylor's internal course registration website. He would then perform the same
requests in PHP using curl. After several moments where everything seemed
hopeless, he eventually got everything figured out.&lt;/p&gt;
&lt;p&gt;A little after midnight, we had version one done. You could text in a course
number and would get notified a few minutes later that it was open (registration
wasn't open yet, so all seats were open). Wes then set to creating a website for
the product while I worked on adding SendGrid integration so that users could
sign up for notification through email. The SendGrid API was also a pleasure to
work with.&lt;/p&gt;
&lt;p&gt;By around 3:00, the website was looking good enough and we had gotten SendGrid
integration working. At this point, we decided to get some rest. We napped in
the hallway for a few hours then got up to figure out our presentation.&lt;/p&gt;
&lt;h2&gt;Presentation&lt;/h2&gt;
&lt;p&gt;HackTX 2014 did "science-fair style" presentations, which I liked. Each team had
a spot on a table, and a number of judges came around to check out each project.
During this time, participants were encouraged to walk around and check out each
others' projects. I was able to check out a few of the projects near us, but did
not spend much time looking around; I wanted to get feedback on what we had
made.&lt;/p&gt;
&lt;p&gt;We received a lot of positive feedback from judges and other participants.
People liked the high level of polish in our product and presentation. They also
believed we were solving a real problem in a good way. Unfortunately, we
weren't chosen to present during the closing ceremonies. On the plus side, we
won three sponsor awards:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;TripAdvisor prize -- 3 day trip to Boston&lt;/li&gt;
&lt;li&gt;Best use of Microsoft Azure -- Dell XPS U12 laptop&lt;/li&gt;
&lt;li&gt;Best use of SendGrid -- Jawbone MINIJAMBOX&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;HackTX 2014 was great. We had fun, built a great product, and won some awesome
prizes.&lt;/p&gt;</description><author>Samuel Taylor</author><guid isPermaLink="true">https://www.samueltaylor.org/articles/hacktx-2014.html?utm_source=rss&amp;utm_campaign=st-blog-feed</guid><pubDate>Sat, 27 Jun 2015 20:44:53 GMT</pubDate></item></channel></rss>