Archive for the ‘code’ tag
450M lines of code say large open source and small closed source software projects are worst quality
The good news is that software keeps getting better, with fewer than one error per thousand lines of code. The bad news is that both large open-source projects and small proprietary software projects tend to have worse quality than average.
Development testing service Coverity’s annual scan report, which is based on data from almost 500 software projects with a total of over 450 million lines of code, says that almost 230,000 defects were found and fixed. And while the average defect density per thousand lines of code was almost identical between open source and proprietary, there was an interesting diversion in the results.
Open source projects, Coverity says, tend to have .69 bugs per thousand lines of code, virtually the same as proprietary software, which tends to have .68 errors per thousand lines. But large closed-source projects — over one million lines of code — tend to have 33 percent fewer errors than small closed-source projects, with .66 errors over each thousand lines to .98 in smaller projects. And small open source projects have a massive 70 percent fewer errors than large open source software, with only .44 defects to .75.
The difference, according to Coverity, is that small open source projects are labors of love by individual developers or small teams, who carefully comb through their code to reduce errors. Large open source projects, on the other hand, tend to lack standardized processes to ensure code quality, and so the error rate increases.
In commercial or closed-source software, developers experience almost opposite conditions. Large projects tend to have well-defined formal testing processes, which ensure higher code quality, and small projects tend to be hasty, quick endeavors that show the effects of growing pains, as no standardized testing is in place.
In other words, if you are looking for bug-free apps, look for a small open source project or a large proprietary piece of software, because those have the best chance of having few defects and high overall code quality.
All of the data in infographic form:
photo credit: gui.tavares via photopin cc
Filed under: Big Data, Dev, Security
what the crapcha is this?
Oh CAPTCHA – the bane of every avid sweepstakes gamer. Now you can really annoy people with impossibly impossible codes with CRAPCHA. Maybe not on any sweepstakes, but it is embeddable on any website.
RCC: spotje NLE over schepping niet kwetsend
De Reclame Code Commissie heeft de commercial ‘Ik zeg zon’ van de Nederlandse Energie Maatschappij niet nodeloos kwetsend is of in strijd met de goede smaak.
Bicycle Commute with a Suit [Commuting]
Commuting to work via bike is a great way to get exercise and save on gas money but if your office has a strict dress code it may seem like you can’t both ride a bike and wear a suit. If you’re willing to take a few minutes and pack a more formal change of clothing you can look sharp at the office and still bike commute. More »
How to: Mine server logs for broken links
I’ve broken this out into lots of steps. You could do it all in one or two steps with a shell script or other geekery. I wrote this to keep each step simple, and get you into Excel as quickly as possible, instead.
I’ve railed about fixing broken links for years now. I’ve presented webinars about it, talked about all manner of fancy tools and generally made myself a pest.
What I’ve never done, though, is shown folks how they can quickly find those busted external links using basic tools. So, here goes:
Why bother?
With a log file, you can find broken external links that Google hasn’t. Google Webmaster Tools only shows you broken links found by Google. GWT ignores:
- Old broken links that Google assumes are no longer relevant;
- New incoming links that are broken, from sites like bit.ly;
- Broken social media links, if they’re not driving many clicks.
Don’t you want all those links from Twitter? How about all the old .edu links you used to have, but lost when you took down the target pages?
Hell yes. Here’s how you can find them using your log files:
Get your tools together
If you’re using OS X or Linux, you have everything you need except, possibly, a spreadsheet program. Google Docs will work, or OpenOffice for big files, or Excel for the coolest stuff (like pivot tables).
If you’re on Windows, you’ll want to install CYGWIN—that gives you all of the command-line tools I talk about in this post.
1: Get access to the log files
If you run your own site, you can download the log files yourself. Otherwise, though, you’re going to have to ask someone else to get ‘em for you, and that’s rarely popular. Here’s how you can make the process less painful:
- Explain why you need them: To improve sales. Log files will give you the best potential linking ‘wins’, and reveal the biggest site indexation problems.
- Explain the value: The log files will let you more accurately spot ‘big two’ issues (links and indexation) than any other method. Both have huge implications for site traffic. Which has huge implications for sales.
- Explain exactly what you need: Don’t just ask for ‘the log files’. Let them know you just need a 5-10 day slice of the files or, if the site’s really busy, just a day or two.
- Provide them an easy secure location to upload the zipped files. An FTP or Dropbox folder should work fine, and it saves them a step.
- Assure them we’ll delete the logs the moment we’re done.
The key here: Make this an easy process. The first concern of whoever you ask for the files will be: “Is this a lot of work for me?” and “Is this a security issue?” Answer those concerns before they’re raised.
1b: If you can’t get the files
I’ve spent weeks, literally, trying to get log file access from a client. Usually, that’s because no one knows what I’m talking about. If you run into this, try these steps, in this order:
If the site’s located with a hosting company:
- Read the company’s tech support docs. You may find the information you need there.
- Check the site’s control panel. It probably has an area for log file management, or a file manager where you can click around and find the log file folder.
- If all else fails, contact the hosting provider’s tech support team. Pick up the phone. Talk to a human being. You’d be amazed how well that works.
If the site’s self-hosted or managed by an internal team:
- Get in touch with whoever manages the server day-to-day. Whether they know it or not, they’ll have the info you need to get the logs.
- If they can’t find the files, but they’re willing to let you get access, get SSH or Remote Desktop permissions on the server. You can then click around and find the log files, or go directly into the IIS control panel/Apache configuration file and find the log file location there.
- If they can’t find the files and they won’t give you access, find out their server platform. Then research possible log file locations on that platform, and ask them to look there.
Got the files? Great! Time to get to work.
2: Extract the log files
Now, you can go download the log files. You probably have a bunch of compressed files up on a server somewhere. They’ll look like the right-hand side of my FTP window:
Transferring files via FTP
Download them to your machine. Decompress them using whatever utility makes sense. If these are .gz files, you can extract them using the GUNZIP command:
gunzip *.gz
That will extract every file in this folder with a .gz on the end, and leave you with something like this:
Extracting files with GUNZIP
Log files may be compressed using ZIP, or something else. You can find the right extraction tool using, I dunno, Google?
3: Combine the log files
Ideally, you need a single log file. To combine the log files, use the CAT command:
cat access_log > biglog.txt
The above command will:
- Read each file that has a name starting with ‘access_log’.
- Write the contents of all of those files into a file named ‘biglog.txt’.
- The single ‘>’ tells CAT to erase a pre-existing file named ‘biglog.txt’ and start over. If you use ‘>>’ then CAT will add to the existing file, instead.
If the files are really huge you may have to keep them separate. But that’ll only be an issue if, once combined, the final file is multiple gigabytes in size. GREP is really good at processing huge files.
Interlude: What you need from this file
You need to find all of the broken external links. So, you’ll need four pieces of data:
- The response code. A web server responds to a request for a broken link with a 404 error code, which then gets stored in the log files you just combined. The response code will let us filter for broken links.
- The referrer. It also stores the referrer—the URL of the linking page. We’ll use this to figure out the value of the broken link.
- The request. It stores the request—the URL of the linked page. The request will tell us which pages we need to replace or redirect.
- The user agent. Finally, it stores the user agent—the type of browser or bot that made the request. This will let us exclude Googlebot visits.
With those four items, you can find all of the external broken links visited by browsers other than Googlebot.
4: Use GREP to find the 404 errors
Now to the good stuff. You’ve got one gigantic log file. You can use the GREP command to search through that file at super speed.
Use this command, changing the htm and file names as relevant:
grep "\.htm*[[:space:]]404[[:space:]]" biglog.txt > errors.txt
This command will:
- Find every line in the log that includes ‘.htm’ and ‘ 404 ’. It uses a regular expression, or regex. I kinda suck at regex, so go to this site if you want to learn more.
- Write that to a file called errors.txt.
This can take a minute or two.
You may need to change the .htm. We’re using to exclude all of the requests for .gif, .png and other non-html files. We only care about pages this time around. If your site uses php, and all of the URIs end with .php, you’ll have to change .htm to .php.
5: Get rid of Googlebot
We need to remove all 404 errors generated by Googlebot. GREP can do the job, again. Use this command:
grep -v "Googlebot" errors.txt > errors-no-google.txt
This command will:
- Search through the file you generated in step 4.
- Find every line that does not include “Googlebot”. The -v inverts the search, so GREP finds all lines that don’t match the search criteria.
- Output that line to a new file called errors-no-google.txt. If the file exists, it’ll wipe that file and create a new one. Use >> if you want to append to the existing file instead.
Notice how fast GREP ran that command? Pretty nifty, huh?
When I ran through this exercise on my laptop, I took a .5 gigabyte biglog.txt file and trimmed it down to a 904kb file that just contained the errors I needed. It took a total of 5 minutes, start to finish. Try this in Excel and you’ll see smoke rising from your computer. GREP is so cool that I’ve written about it before.
6: Prepare your spreadsheet
Using whatever spreadsheet software you prefer, import the errors-no-google file as a space-delimited text file:
A space-delimited import in Excel
You won’t need most of the columns. Only three columns really matter:
- The column that includes GET or HEAD and a URL. That’s the request—the page on this site that someone tried to load.
- The column that includes a three-digit number. It usually comes right after the request. That’s the response code—the server’s reply to the request. If GREP did its job, the response should be 404 for every line in the sheet. Sometimes it goes wrong, though, because of a ‘404’ somewhere else in the row. Poop happens.
- The next column should be a URL, or a dash. That’s the referrer. If someone clicked a link on another page, that other page’s URL is the referring URL. It’s shown in this column. If they typed in the page address, or if their browser is set up to hide the referrer, the referrer is ‘-’.
You can delete the rest of the columns. Then insert a new row at the top of the page and label the columns:

That’ll let you indulge in some data processing niftiness later on.
Oh, and save the damned spreadsheet. Nothing sadder than losing all your data because your cat strolled across the keyboard.
7: Set up filtering.
Put your cursor in the heading row you created in step 6 and click the filter button:
The filter button in Excel
Now you can sort and/or filter our stuff you don’t need. For example, I may not want to see all of those ‘-’ referrers:
Filtering out ‘dash’ referrers in Excel
And I probably only want to see external broken links, so I can filter out all referrers that include this site’s domain name:
Filtering out a domain name in Excel
Note that I used ‘does not contain’ for the second filter. Read up on Excel’s filter tool. It’s your friend.
8: Find the broken external links
Phew. Finally. We can find some external links. Take a look at the result:
The final spreadsheet – a link goldmine!
It’s a link goldmine!!! Every row represents a broken link from another site.
Now you can use a pivot table or other spreadsheet awesomeness to find the biggest problems:
Pivot table report showing most-requested broken links.
Or, you can just browse through the raw data. Either way, you’ll find great, easy incoming links.
9: Prioritize the links
Prioritize broken links like this:
- Broken links from high-authority sites get fixed first. These links could really give you a rankings boost.
- Broken links with a high number of requests get fixed next. A lot of people are still clicking them.
- Everything else.
10: How to fix the links
None of this work means a thing if no one fixes the links! Here are the ways to fix them, from best to worst:
- Rebuild the missing page. If the broken link points at a deleted page, replace that page. If the site’s an online store and the link points at a product that’s out of stock or no longer available, put up a page, at that URL, that says ‘This product is out of stock’ or ‘This product is no longer available’. Then provide links to other relevant pages, or to customer support, or to the category page.
- Build a new page. If the broken link points at a page that never existed or had to be deleted, create something new (but relevant) there.
- Build a detour page. Create a page that summarizes what the old page said and then says ‘But this page is gone now. Sniff. Instead, go over here.’ Then link to an alternative.
- Use a permanent redirect. Create a 301 redirect from the broken link URL to a relevant page. Do not simply redirect to your home page! That just confuses your visitors.
Always use options 1-3 before 4. A permanent redirect is a very imperfect solution, and best applied when you have no other options. 301 redirects will reroute authority for a while, but eventually the authority ‘decays’. Plus, a high number of 301 redirects on a site can wreak havoc with Google and Bing. Both search engines’ crawlers will give up if they see too many redirect ‘hops’.
Put away that letter opener…
This post has over 1800 words. At this point you’re probably ready to stab me. Please don’t. I like my insides in.
And, this isn’t nearly as hard as it seems. With practice, you’ll be zipping through all these steps in under an hour. It’s by far the quickest, easiest way to improve site authority.
Google Gets Slammed with the Biggest FTC Fine Ever
Google has been hit with the biggest fine in the history of the Federal Trade Commission: $22.5 million. It has to do with cookies, bits of computer code placed on your browser when you visit a website. Read more » about Google Gets Slammed with the Biggest FTC Fine Ever
Another Big Social Marketing Exit: Gannett Will Buy BLiNQ Media For Up To $92M
On the heels of Google buying Wildfire and Salesforce nabbing Buddy Media, we have heard from two very reliable sources, plus a third anonymous source, that Gannett Co., the media giant that owns USA Today and other properties, is buying BLiNQ Media. The price for the Facebook advertising software and service is up to $92 million over a period of three to four years, with a quarter of that amount, $23 million, coming up front.
We hear the purchase agreement has been signed and the pair are now marching towards a close at the end of this month. The rationale behind the deal is clear: when brands buy ad placements on Gannett properties, it could use BLiNQ to also sell them ads on social sites and collect a solid margin.
Gannett is looking to BLiNQ, which has built up a profitable Facebook ads API business, to become G’s equivalent of the Washington Post Company’s SocialCode, its social media marketing and analytics agency (which picked up 15 Digg engineers in May). Gannett and BLiNQ, TechCrunch understands, have already been working together for about a year on ad campaigns for advertising clients, primarily via those brands’ agencies. This will bring more of that expertise in house.
Digital was one of the bright spots for Gannett in its Q2 earnings, reported in the middle of July. With overall revenues of $1.3 billion down 2% on print advertising pressures, in its publishing segment, digital revenues were up by 29.3%; in its U.S. Community Publishing division they were up 33%; at USA Today they were up by 37%; and at the company’s Newsquest UK division they were up by 10%.
BLiNQ has only taken in about $3 million in funding, none from VCs, since launching in 2008. It was profitable in its Facebook marketing business from early on, and so it hasn’t needed to seek outside investment. More recently, it has been expanding into marketing on LinkedIn and Twitter, as well as Facebook’s mobile advertising efforts.
Among Gannett’s assets are digital marketing agency PointRoll and online circular company ShopLocal. You can see where a BLiNQ Media acquisition could position it very well to offer social ad buying services and tools to its advertiser and local business clients.
TechCrunch understands that in addition to David Nicol Williams, the co-founder and CEO of BLiNQ, Gannett was also interested in startup’s engineering team, led by CTO, Luis Caballero, who had also built up the engineering team at Vitrue. (Both companies are based in Atlanta, Georgia, which it turns out is something of a hotbed for social media marketing. Who knew?) TechCrunch understands that like Vitrue, Blinq has some IP that it is patenting. Blinq’s is centered around media optimization algorithms.
There is still “tons of innovation” that Blinq has left to roll out, we understand, so this could be the start of something interesting. With big media properties and advertising clients to test its products, it could crack the social code before Social Code, and help Gannett stay profitable as paper print media ends up in the wood chipper.
Delicious founder wants you to build micro-apps with this new framework

“What the heck is a micro app,” you might ask? It’s the thing that Delicious founder Joshua Schachter (pictured) wants you to develop for smartphones.
It’s not a full application; rather, it runs on an app (Human.io for iOS and Android), and it allows you to quickly roll out a feature (conducting a poll, posting a photo, etc.) to users. You use the Human.io API; you adhere to simple design principles; and you gather whatever data you like from Human.io users, who are basically data-whore volunteers.
And since it’s a smartphone micro-app, you can take advantage of smartphone features and hardware, such as cameras, GPS, etc. You can also choose to keep the micro-app a micro-secret, only giving access to users who have a special code.
It sounds easy and fast, and Schachter emphasizes that it doesn’t take too many lines of code to get up and running.
“The framework is entirely API-based,” we read in the documentation. “In a few dozen lines of code, you can render UI, upload items to the human.io server, and retrieve responses from users as they complete activities.”
Suggested activities include collecting wait times in lines at restaurants, amusement parks, etc.; photo-sharing around a theme or product (blatant brand pandering alert!); collecting ratings (think Instagram meets Hot or Not — really humanitarian stuff); or something as practical as a survey.
From the end user’s POV, the Human.io app is a way to participate in crowdsourced tasks, either for the common good or for a kickback from brands. From the app’s description in the Google Play store:
Human.io connects you to organizations that need your help with all kinds of tasks, such as completing surveys, taking opinion polls, rating the quality of different images, and more. You can choose to browse only tasks near you (e.g., helping out a local restaurant or charity), or you can lend you help to tasks anywhere. Every day you’ll find new tasks to complete.
“The code runs on your server, but the UI runs on the device. The events are gatewayed back and forth,” Schachter explained to the army of nerds over on Hacker News. “The ultimate vision is to make a way for passive audiences into active participants. We combined things we love: mobile, Mechanical Turk, MapReduce, and Twilio.”
Here are some sample screenshots:

“We need a lot of polish still,” Schachter admitted on Twitter.
Human.io is a product of Tasty Labs, Schachter’s software shop, which is fleshed out by staffers formerly of Google, Dreamworks, and Mozilla.
Filed under: dev ![]()
Malware related to Stuxnet and Flame found stealing bank information
Kaspersky Lab announced a new piece of malware that specializes in obtaining login information for bank accounts in the Middle East. It’s called Guass and is linked to Flame, Stuxnet, and Duqu.
“Gauss is a complex cyber-espionage toolkit, with its design emphasizing stealth and secrecy; however, its purpose was different to Flame or Duqu,” said Kaspersky Lab chief security expert Alexander Gostev in a statement, “Gauss targets multiple users in select countries to steal large amounts of data, with a specific focus on banking and financial information.”
Kaspersky found the malware after digging deeper into Flame, a virus uncovered in May that was billed as one of the most advanced cyber espionage tools to date. Researchers said the malware has “striking resemblances” to Flame in the way it was designed. It seems Guass shares the same source code from which Flame was built. But its actions are slightly different. While Flame installed a keylogger, turned on the computer’s microphone to record audio, and monitored “communications apps” such as IM, Gauss is focused on obtaining financial information.
Guass is tailored to steal “access credentials” to Lebanese banks, which include Bank of Beirut, EBLF, BlomBank, ByblosBank, FransaBank and Credit Libanais. Non-Lebanese entities that are also targets include Citibank and PayPal. This information along with browser history, cookies, passwords, system configurations, and more is sent back to the command and control servers. The malware, however, is in a veritable holding pattern since the command and control servers were shut down in July.
Kaspersky estimates that the number of infections are in the tens of thousands, but as of May around 2,500 infections were recorded. This is lower than Stuxnet, but higher than Flame, which Kaspersky says had around 700 infections.
In June, Kaspersky linked Flame to Stuxnet, the famous malware that hit Iran’s nuclear infrastructure in 2010. Many of Flames functions looked identical to those of Stuxnet’s, spurring Kasperky to dig deeper into the connection. Now the research firm says the two may have had creators that worked closely together, even sharing some of the same source code.
Gauss is the latest member of the family.
hat tip Wired; Image via Shutterstock
Filed under: VentureBeat ![]()
How to Listen with Free and Paid Social Media Monitoring Tools – Pharma and Healthcare
Yesterday I spoke at the 2012 Digital Pharma Seminar hosted by Princeton Digital and VIVA!Communications. The topic was “building a digital roadmap” … or more precisely, where to start with digital strategy for pharmaceutical companies.
In highly regulated and competitive industries like pharmaceuticals, there is, however, a greater level of attention paid to the area of “listening and monitoring”. And the Medicines Australia code of conduct, edition 16, explicitly addresses social media in section 12.9 as follows:
Information provided to the general public via any form of social media must comply with the provisions of Section 12.3, 12.4, 12.5, 12.6 and 12.7 of the Code.
The focus, essentially, is on the provision of accurate and scientifically reliable information – not promotion. So I thought it may be useful to modify my How to Listen Infographic specifically for pharma. It also includes a small selection of tools that can be used to support your monitoring efforts – including free sites like SocialMention.com and Google Alerts – as well as “for fee” and “freemium” platforms like Radian6, NetBase and HootSuite.
Be sure to let me know if you have improvements or other suggestions!









