Saul Pwanson - It's not hoarding if it's organised

Saul Pwanson became an accidental investigator when his attempts to digitise crossword puzzles into a unified format uncovered a plagiarism scandal in the crossword puzzle community.

“I think that it's really important that we have more independent investigators, or you call them ‘citizen investigators’, people who have the time and the space, like I did. I was willing to put months into something that any actual journalist, who's being paid, there's no way that their editor would approve them to be paid for months to do this kind of work.”

Download this episode.

Subscribe to the RSS feed or Listen to this podcast on your preferred podcasting platform.

About the speaker

In 2016 Saul Pwanson designed a plain-text file format for crossword puzzle data, scraping tens of thousands of crosswords from various sources. Having access to this data he discovered plagiarism by a major crossword editor that had gone on for years. The scandal got coverage in the crossword puzzle community.

Transcript

Something smells funny. That's the clue to dig on it. If it doesn't seem right, something isn't right. Or at least, when you do the exploration or the investigation, you'll learn something – whether it's because there's literally something going on there, or because your processes or your own understanding of the subject matter is incorrect.

Hi, my name is Saul Pwanson. I do data analysis and data tooling kind of stuff. Couple of years ago, I stumbled across a crossword scandal, accidentally. And I became an accidental investigator. So there was an uproar for, probably a week or so actually, where lots of activity on Twitter, it was #gridgate. It was kind of an open and fun scandal. I think that everyone loves a scandal, and especially when it's a trivial scandal, it's really easy to get behind. It's like, no one's no one's really been hurt by the plagiarism of crosswords. And a lot of people do crosswords, from all over the country and all over the globe, and so it went far and wide.

I mean, I've liked the crosswords for a very long time. I used to do the crosswords with my grandmother and I started getting into crosswords more seriously, started making them for myself. And the more that you dive into crosswords, the crossword world, the more you realize that it's an art form. So I started creating them myself and wasn't very good, of course, because that's how it goes. And I decided to try to get better at it. And so I wanted to investigate, or I just wanted to gather some data and try to figure out how I could be a better crossword constructor. And the thing that stymied me was that there was a lot of data out there in the world, and you could see it – you can see that it's out there, but you can't access it yourself. And that kind of thing frustrates me. And so I decided to take it upon myself to organize the data, to get it into a form that I could use it, and was readily downloadable and explorable. I was actually developing a thesis that proper organization of data was the key to data analysis and exploration – that if you get data into the right form, then people can run with it and they can discover things they never would have thought possible. It's really, it's a lot easier to find things when you do undirected exploration of a quality and well-formed dataset than trying to do directed exploration on a difficult or ungroomed dataset.

I drove myself thinking that this would be a gift to future crossword scholars. Everybody who wants to do their independent analysis shouldn't the same amount of work every time. We should pool our resources, if you will. And to some extent, this has happened. Like there was a person who led a crossword litzing effort, which is taking old crosswords that were printed only in papers and digitizing them, turning them into an electronic format. And that was thousands of hours of work right there, and by virtue of their efforts they got all the New York Times crossword puzzles since the very beginning, since the 1940s into digital format. And I would not have been able to do that work – the work that I did myself – if it hadn't been for their work. And so I feel like my work was just one layer on top of that, which was again, taking that and instead of having it be in a silo where nobody else could use it, putting it into a more usable and centralized form. There's a main site that's got the New York Times crossword puzzles on it. And so you can go there, and you can find any of the New York Times crossword puzzles. But it's kind of like drinking through a straw in terms of getting the data. Like, I want the firehouse. I want the full–– I want the jug and not just have a sip one piece at a time.

I have this belief that it's not hoarding if it's organized. And so, given that I had all this data, I feel like all I have to do in order to justify that is to organize it and make it into this clean arrangement. And then I can consider it a museum or an archive of some kind. Webpages are no fun to work with. Well, to view, they're great. But if you want to actually process them, they're kind of a pain. And so you have to parse them. And it took quite a bit of work to take the webpage and convert it into a plain text file from it. I designed a plain text file for it and I thought it was very important to have it be excruciatingly simple. It's like, you've got these papers that you've organized and you kind of lay them out on the floor, and you kind of rearrange them, and you kind of shuffle them a certain way, and then if they're organized well you can put them in a certain configuration and then you kind of just rifle through them. And when you see the flipbook that's on the margins, for instance, and it's out of orde, and you're like, ‘well, that seems like there might be something very’ . . it's kind of compelling to bring you to try to figure it out and sort it out further. It was actually very interesting once I had done that. I hadn't actually done any investigation until I had that process largely figured out. But then once I had that collection of crosswords like that, the investigation itself took basically a weekend.

The collection itself has about eighty thousand crosswords at the moment. And at the time it only had – well, ‘only’ had – about fifty thousand crosswords. I did some back of the envelope calculations at the time, and I figured that there are on the order of a million crosswords that have ever been published. And maybe I'm way off, maybe it's two or three million, but it's far fewer than ten million. And if you think about a couple of million, that's a lot of crosswords, but eighty thousand is actually a substantial percentage of those. If there are a million crosswords that are published, this is some 8 percent. I actually think that it's feasible that we could collect all crossword puzzles that have ever been published. I mean, you know, there may be some crosswords that have been lost to the sands of time, but in general, I think that's actually a feasible thing to approach that limit. It's not important with a capital ‘I’, but I think it's kind of important from a data perspective to try to be completionist.

My partner does not understand. She thought that I was kind of nuts for spending all this time with crossword puzzles. You know, crossword puzzles are a complete diversion. Like what is the point of this? And I can't really explain it. I don't know, except for, something got under my skin. And I think that's maybe the key, is that it's really hard to motivate yourself if something isn't under your skin like that. But if it is, all you can do is not do it, and that's disappointing. Like you can't force yourself to do something that you're not. . . that doesn't irk you, doesn't get you like that. And this one got me and I decided to run with it because that's the only way I can actually get motivated on things. So I kind of find the ones or listen to the ones that aggravate me, if you will.

The first inkling at all was, I had done some very simple queries over it to try to see if there's any... how common were duplicate patterns in the grids? And I was just kind of screwing around just to kind of get a sense of what this looked like. There actually are some really interesting examples, besides the scandal that emerged from this, of outright plagiarism, people who had taken puzzles from thirty years ago and re-clued them – basically submitted the exact same grid, but with different clues. I basically tugged on that thread, and one editor kept coming up very frequently. And you kind of assume that the plagiarists are going to be individual constructors. The way that the crossword world works is that the editor–crossword publications publish something either once a week or once a day in their publication. And that's a huge amount of work, and one person couldn't create all those puzzles. And so what they do is they invite the community to submit puzzles and they pay them for their work, some token amount of money. And so there's constantly a queue of people submitting crosswords into the system. And those people, in modern times anyway, get a byline on there and that's part of the metadata. So I expected, because of the ones I saw from the New York Times, for instance, they were individuals. And I saw one person that had done this three or five times and like, ‘okay, I can see that person's, I can see what they're aiming at’. And what was interesting about this other one was that it was an editor that was consistently being involved here. And actually, I didn't know very much about the crossword industry at this point. And the guy's name is Timothy Parker. And Timothy Parker is kind of a generic name. I didn't even know he was a real person. He's edited a lot of crosswords and he was on several publications. And so it seemed like maybe it was the pseudonym for a cabal of constructors, or something like that. And the thing that really made me tingle, if you will – and I remember that moment. It was when I saw that there were two puzzles and I had them side-by-side, that were very, very similar. It was clear evidence of somebody who was reconstructing or reusing an old puzzle. And the bylines, the names on the puzzles, were different. And the second name was Elizabeth Gorski. And she's a well-known constructor in the crossword community, and well loved. Just didn't seem right. It was like, well why would. . Elizabeth Gorski is not a plagiarist. Or if she is, this is a huge deal! Like, I couldn't even figure it out. But then there were so many other examples that involve this person that didn't involve Elizabeth Gorski – that's something else that also seemed funny – and the original author was Tim Burr, which is a pseudonym, kind of a funny name, right. It just really didn't sit well with me. Every explanation I came up with was–– it just doesn't work. But what was very interesting about this, if we skip to the end of how this turns out, I actually didn't discover – or this first clue wasn't evidence of the actual plagiarism – as always, it was evidence of the coverup. This was actually the last in a long line of puzzles that started with her original puzzle. And he had ripped it off three or four times. And then finally was running reprints of older puzzles, legitimate older puzzles, just as they were, with their original name. It was, like I said, valid. But I had the previous, um, plagiarized puzzle in the collection and then this, newer reprint in the collection, and that completely set off my alarm bells.

I knew that I had something. I knew that I wanted to present something to the crossword community. And this is where I was talking about organization, and I think of things in terms of organizing data. But organizing people is – and being in touch with people or the subject-matter experts – is key I think to this exact kind of investigation. This would not have gone anywhere if I was not connected, at least to the mildest degree, to the crossword community. I didn't know what they would think about it, but I thought they might be interested. And it was actually kind of a way in to the crossword community. It's like, ‘Hey, I found this neat result, right? This is interesting stuff’. And I thought it was just kind of like more of a curiosity. And then instantly though, after I posted that to the list, I got some high-profile constructors to saying, ‘oh, hey, wait a second. Just, you know, go easy because this may be a big deal, and we don't want to launch accusations unnecessarily’. But this smells funny to them too. Everybody was interested in how their own puzzles had been plagiarized. And so there was kind of like, people were indignant or offended that they were involved or, you know, that their puzzles were involved in this. And one of those people, Evan Birnholz posted to Twitter about it. And it kind of just had an aggravated, like, ‘ah, what's this guy doing then’. And somebody files – Ollie Roeder from FiveThirtyEight saw that and then reached out. And he's the one that drove the actual scandal, both the narrative and the story, and did some, a lot, of the groundwork, the legwork to validate the stuff that I had seen, and to confront Timothy Parker, and to gather a lot of quotes, and then ultimately publish the story, which is what got all the attention.

There was a moment during that first weekend, before I had said anything to anybody where I discovered this crossword editor, and I discovered that he had a Wikipedia page and that he was a real person. And there was this picture looking back at me. And I realized that if what I was thinking was true, that I was potentially going to ruin this person's career, that the work that I had done was going to do that. It turns out that Timothy Parker is a well-known crossword editor and has been editing and publishing crosswords for about twenty-five years and is not a well-loved person in the crossword community. He's had run-ins with many people and doesn't come off looking all that great. So it wasn't actually a surprise to anybody in this, that for years he'd been taking puzzles that he actually had licensed to, to use. That was the very interesting thing – this is called self-plagiarism. He actually took puzzles that he had edited previously, and then republished them, changing the byline and changing small aspects of the puzzle, to in some cases maybe bring it a little more up-to-date, but in some cases just trying to cover up his own tracks, and then republishing them. And so he logged them–– he had logged himself as having the Guinness World Record for the most syndicated constructor, for instance. So he had a big ego about how many crosswords he had managed to publish. And it turned out that for a period of time, about 2008 to 2012 or so, some hundreds of puzzles that he had published were exactly like this. And so it calls into question his entire crossword construction or crossword editing credibility. I think that if he had just chosen to republish the crosswords, done mild edits but kept the same by-lines, that he basically would be in the clear. You know, it's maybe not great, he's kind of a hack, people take issue with the kinds of crosswords he constructs because they're not very well done in some sense. But I don't know, this is just business, right? And it's simply the act of changing the byline in the first place is where the line gets drawn, where people have put dozens of hours into a single crossword often, and then submitted it and they’re not going to get paid anymore if he republishes it – it's not about the money. It's not about that. It's about artistic integrity and taking credit for other people's work. And this is where, if you think of crosswords as disposable and not for, and just like a Sodoku puzzle, for instance, then there's nothing really here at all, but they're not Sodoku puzzles. There’s more of an art here.

Of course, everybody denies it. And Timothy Parker says, this is not. . . must just be an accident – which is laughable, on the face of it, when you look at the actual data. Ultimately he was. . . he stepped down from being the editor of USA Today. But what's interesting though, is that there was a big scandal and he did have to step down from that, but his own crossword publishing company still went fine and he's still publishing and editing crosswords. It really did not even destroy his career, because really, you know, a couple of months went by and it kind of blew over. I maintain a website, xd.saul.pw, on the crossword thing, and the front page just shows the crossroads that I’ve collected and shows duplicates or potential duplicates in a certain colour format. And you can see very loud and clear where this scandal happens. And the idea is that if somebody was to do something like this in the future, that it would show up on this grid very quickly. That's the idea – it could become, could be a kind of watchdog in this sense.

I don't like working under time pressure. I feel like I don't do my best work under time pressure. This article was coming out in basically a week from the time that I gave the information to the crossword list. I doubled down on it and really had to – I was going to – I did a batch of crosswords in the morning and I went to work, I had another batch in the evening. I had a whole process and it kind of consumed my life for that amount of time. So there's a buzz, it's kind of a, yeah, a mania. When I'm the only one that's feeding the mania, I can kind of calm myself down and say, okay, this is just, you know, I'm kind of used to that, and I kind of take it a little easier and take a break or whatever. But when the world is feeding into that it becomes a lot harder, and I'm not even sure if you should not go into that as much as possible, but it definitely consumed my life for a week or two. I think I did a pretty good job, in general, but I learned a lesson from this it’s don't release your results until you're. . . until you have results that you're comfortable with the press having access to, the press publishing themselves, because you're not going to get a chance to do it better. Everything only has it's one chance in the spotlight. There's not going to be a second gridgate or a second revisiting of this. I've actually even tried talking to Ollie and saying, ‘can you do an update on it?’ and there's just no interest. And so you have your one shot and so you better make sure your ducks are in a row and that you've gotten your stuff right and you've got all the information that you can get out of this or that you think is important out there in the first place.

Once people smell blood in the water, there's a tendency to claim it for themselves, if you will. And definitely in the case of journalists, I think that's – I'm sorry, I know actual published journalists, that's their whole bread and butter is to become famous for whatever article that they wrote – and so I think it was more taken from me than yielded in that sense. But I definitely did and do feel an ownership of it. Like if it hadn't been for the stuff that I had put together, this never would have come up.

I don't really have the heart or the stomach for frontline investigative work. And so actually the stuff that I think of myself as doing is making tools that enable investigators. And so in the case of the crossword puzzles, the tool that I was making was a crossword corpus, a collection of crossword puzzles that would be valuable to people who wanted to do their own investigations into it. And it just turned out that I had done such a good job making this, that I kind of nerd-sniped myself into doing the actual investigative work. It was so. . . it was right there, and it was easy enough, and then I just kind of fell into it because there was low-hanging fruit sitting right there. I don't think of myself as an investigator. And yet when other people claim credit for the investigation, I felt myself getting a little aggravated at that. Like, ‘no, I did actually the most of the work here’. I mean, to Ollie’s credit, he did a lot of the verification. He did confront Timothy Parker and he did publish the article, but the actual discovery of this thing was me. And I wanted credit for that.

I have a little more cynicism or skepticism of the media because I saw how they took actually a fairly nuanced situation and blew it up into a scandal that was very clear-cut. There were bad guys, an all-bad guy and they want to take him down, and this whole other narrative emerged. And I don't even think that's strictly wrong, but there's very little effort – and this is just how the media is structured – very little effort to explore the nuance and try to figure out what really is the line here. There was not very much. . . I feel like a true investigative journalist would have tried to really figure out the meat of this, instead of just exposing a scandal. That's been kind of disappointing. When I first saw the article, I had that opinion. And then other, of course, journalists reached out to me in the days following the article published, but they weren't interested in this, the nuances, either. They wanted the little sound bites, so they could do a rehash of the original article. Everyone's on deadline, everyone’s got three articles to produce this week, and so, you know, just please write back with a little cute snippet and then we'll publish that. And so, yeah, I feel like a little bit more world weariness, unfortunately – and maybe I shouldn't say unfortunately, this is just how the world works. And so I want to make sure that I'm more diligent and more careful and more ready to deal with the world coming with their own preconceptions, when they do. So I think I'm less naive and hopefully hopefully more rigorous or more – definitely more empathetic when I read articles that are trying to present a certain worldview, like realizing that there’s a whole deep story and 99 percent that's invisible underneath the surface. And this is true of many fields. Um, for instance, software is very similar to this, where there is being a software developer and there's being a professional software engineer, and you're not allowed often to do your best work or even decent work sometimes. You're constrained by the forces in the market and the situation that you find yourself in. I think the same thing is true of journalists. They aren't given the space and the time to really do the proper investigations. So they just can't. I think that it's really important that we have more independent investigators, or you call them ‘citizen investigators’, people who have the time and the space, like I did. I was willing to put months into something that any actual journalist, who's being paid, there's no way that their editor would approve them to be paid for months to do this kind of work. It's not going to happen if someone like me doesn't do it. And then I think it's important for us to try to get journalists to do better in presenting the information that they come across. But yeah, I feel like it's rare these days and, because of the constraints of actual journalism, for journalists to be doing sincere, deep investigations. There is a way of presenting data, or whatever thing you're looking at, and presenting data such that it becomes obvious what you're looking at. That's a huge skill and super important. But it's really easy to massage statistics and numbers to prove whatever point you want to do. This is kind of well-known, it makes people distrust numbers and statistics. But if you do it right, it doesn't even–– even with that cynicism, unless you're actually making up the data wholesale, which should be its own scandal, if that ever happens, whenever that happens – but aside from that case, there is a way that you can present data that it becomes just blindingly obvious that something is going on, this is what's going on. And then you can let people draw the conclusions for themselves. In which case, you’re kind of leading people into the investigation as opposed to giving them the conclusion and telling them, ‘well, the P value is this, and so therefore this must be true’. I think we're still in the very early days of journalists using data in this kind of way. I think that probably the best way that this happens is that journalists become better at expressing the data, rather than finding the data or interpreting the data in the first place. And that, given that there is this conclusion that someone who has delved deep into it has come up with, ‘okay, how can we share this in a way that makes it clear to people who aren't data people’. So journalists being the bridge between the hardcore data wonks and the general public who either doesn’t care about statistics, or just doesn't have the background to understand it.

I believe that investigation is a fundamental skill, like reading or math, that everybody is an investigator or should be an investigator. And the fact that we don't think of ourselves that way is . . . an investigation is, in my mind, listening without bias, trying to seek the truth, independent of an outcome. I'm a big fan of undirected exploration. I don't want to be trying to nail somebody or find the dirt on this thing. It's like, no, let's really look at the evidence we've collected, the mass of data and facts, and see what there is there. Discern the relevant from the trivial. Try to ask the right questions, the ones that the right people – which are probably not the ones that you're thinking of – and asking them at the right time, once you have a solid understanding of what you've got, what you need and with the right attitudes that they join you in your quest for truth. I think persistence in investigation is really important, that if it smells funny, you need to investigate and not just write it off, that you'll learn something either way. So I feel like this is actually a very important thing and not just out in the outer world, but in terms of your innermost mind and thoughts and your personal life, and just kind of everything would benefit from more of a sincere investigation into what's literally there.

Credits

The Exposing the Invisible podcast series is produced by Tactical Tech. 

Interview, production and sound design by Jo Barratt.

Tactical Tech's Exposing the Invisible team includes Laura Ranca, Wael Eskandar, Marek Tuszynski and Christy Lange.

Music by Wael Eskandar.

Additional music is November by Kai Engel used under a Creative Commons 4.0 Attribution licence and Waltzing in the Rye by Kai Engel used under a Creative Commons non-commercial 4.0 Attribution licence.

Illustration by Ann Kiernan


This podcast episode is part of a series of resources and publications produced by Exposing the Invisible during a one-year project (September 2020 - August 2021) supported by the European Commission (DG CONNECT)

European Commission

This content reflects the author’s view and the Commission is not responsible for any use that may be made of the information it contains.


More about this topic