HTTP Archive: past and future

October 12, 2015 4:19 am | 6 Comments

Background

The HTTP Archive crawls the world’s top 500K URLs twice each month and records detailed information like the number of HTTP requests, the most popular image formats, and the use of gzip compression. We also crawl the top 5K URLs on real iPhones as part of the HTTP Archive Mobile. In addition to aggregate stats, the HTTP Archive has data (including waterfalls, filmstrips and video) for individual websites, for example, Apple, CNet, and YouTube.

Growth of websites Oct 2012 to Oct 2015

Pat Meenan and I started the project in 2010 and merged it into the Internet Archive in 2011. The data is collected using WebPagetest. The code and data are open source. The project is funded by our generous sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Radware, dynaTrace Software, Torbit, Instart Logic, Catchpoint Systems, Fastly, and SOASTA mPulse.

Past Year

Pat and I are always working to improve the HTTP Archive in terms of reach, metrics, and features. Here are some of the major achievements over the past year.

500K URLs: One of the most apparent changes in the last year is the increase from 300K to 500K URLs tested on desktop (using IE9). These are the world’s top URLs based on the Alexa Top 1,000,000 Sites. Our goal is to reach 1 million on both desktop and mobile.

New Hardware: Adding more URLs was possible when we upgraded our hardware adding a Supermicro 2U quad node server and ten Samsung 850 Pro 1TB SSDs. This added enough test capacity and storage to increase to 500K URLs. This new setup should be enough to allow us to go to 1 million URLs on both desktop and mobile once we settle on test environments. (More on that later.)

New Sponsors: In the past year Fastly and SOASTA mPulse joined our illustrious array of sponsors. This was critical in terms of finances since the new hardware greatly reduced our savings. Just last month we had to replace the SSDs on our original server and the funds from these new sponsors made that repair possible. Many thanks to all of the HTTP Archive sponsors for making this project possible.

Better Content Types: We improved how responses are classified into content types. The Content-Type response header is incorrect about 5% of the time, and these used to be classified as “other”. Now most of the “other” responses are given one of these new content classifications: video, audio, xml and text. As a result, the number of “other” responses dropped dramatically when this change was rolled out in April 2015.

New Formats: In addition to improving content type classification, we added new code to detect the format for images (gif, jpg, png, webp, ico, & svg) and video (flash, swf, mp4, flv, & f4v). This allows us to do deeper comparisons of new formats such as WebP as shown in the chart below.

Chrome: From the inception of the HTTP Archive the goal was to do our testing on the browser version used by the most people in the world. Back in 2010 we used IE8 for all testing. A few years back we switched to IE9 when it became the most popular browser. Today, Chrome is the most popular browser so we need to switch to Chrome. In discussions with Pat and others in the HTTP Archive forum, we decided to run Chrome and IE9 side-by-side so we could compare the results before making the transition. Therefore, we created HTTP Archive Chrome back in May 2015.

Android: HTTP Archive Mobile processes 5K URLs using real iPhones, but it’s not feasible to scale this up to our goal of 1M URLs. Therefore, we’re evaluating using Chrome to emulate Android devices. In May 2015 we rolled out HTTP Archive Android to start gathering data in order to evaluate making this transition.

Looking Ahead

Our goal from the beginning has been to analyze the world’s top 1 million URLs. In order to do that we need to transition to using Chrome for desktop analysis and Chrome emulation for mobile. Now that we have months of data in HTTP Archive Chrome and HTTP Archive Android, the next step is to validate the results by comparing to the previous IE9 and iPhone results. It’s likely there will be differences, and we need to identify the causes for those differences before making the switch.

In addition to changing our test agents and increasing the number of URLs, we need to upgrade the charting software. We’re also going to focus on some new custom metrics that focus on identifying critical blocking resources.

Please check out these new changes in the HTTP Archive, HTTP Archive Mobile, HTTP Archive Chrome, and HTTP Archive Android. Make sure to search for your website to see your own performance history. If you have your own questions you’d like answered then try using the HTTP Archive dumps that Ilya Grigorik has exported to Google BigQuery and the examples from bigqueri.es. And send any other questions or suggestions to the HTTP Archive forum.

 

6 Responses to HTTP Archive: past and future

  1. Hi Steve!

    These are great news & I’m very much looking forward to this bright future for HTTPArchive. It has been a big help in the past already & with those improvements, it will certainly be an even greater resource. Thank you for making it happen!

    One question: will the data from HTTP Archive Chrome and later also the HTTP Archive Android make it into Bigqueri.es?

  2. Hi, Tobias! Yes, starting this month the MySQL dumps will be available for download as well as in Big Query.

  3. Thank you for all the work that goes into HTTP Archive. A couple of quick questions.

    What sources are you using to determine the most popular browser?

    we’re evaluating using Chrome to emulate Android devices

    Is this the Chrome mobile device emulation? Or are you looking to come up with some way to emulate actual Android hardware?

    Mobile testing seems to be the next big hurdle in general for perf testing. Chrome mobile emu is helpful, but doesn’t provide the impact of running on actual mobile hardware ( vs. desktop systems ).

  4. Joseph: Great great questions.

    1. I look at various browser stat sources. All of them show that Chrome stable is the most popular single browser worldwide. IE is very popular if you add up usage across all version (6, 7, 8, 9, 10, 11, edge) but we can only test on a single version, and Chrome stable capture that.

    2. We’re using WebPageTest’s features which allow us to set viewport size and User-Agent string, as well as bandwidth settings.

    3. We are not looking to emulate hardware. It’s true that our tests might not reflect the performance of mobile hardware, but it’s important to keep in mind that HTTP Archive is mostly about non-time metrics (# of requests, image formats, use of gzip). In fact, you’ll notice we do not show any time charts. We’re only doing 3 page loads from one browser in one location with one browser. A much larger (orders of magnitude) sample set representing a wider range of settings would be necessary to have confident time measurements.

  5. They’re obviously meant to be a time sequence but what exactly are numbers under the x-axis.

  6. D: In the first three charts, the x-axis is dates in MM/DD format. In the final chart the x-axis is size in KB.