Sunday, 15 March 2009

DVLA Bites

Ok so I'm at work and I get a call from my housemate saying "There are some guys out the front of the house loading your bike onto a truck". So the long and the short of it is that I didn't pay my road tax due to the fact that I never received the notice. This really bothers me as the DVLA clearly knows where the bike is, what my name is, what my email is, what my phone number is etc. Instead trying - just once - to get hold of me they put the bike on the back of a truck, haul it half way across London and then expect me to pick it up. 

Why not just give me a parking ticket?

Anyway I pay the tax and then go to pick the bike up, and guess what the guy serving me has so much attitude I actually have to say "you don't need to be so hostile towards me". No doubt most people that come to collect their vehicle are pretty pissed off. And so was I, though I was still polite (at that point). 

I get to the bike and see they have cut the auxiliary chain that is not used for securing that bike to anything, only to my other bike. Why? No idea, it wasn't stopping them moving it in any way. Also my housemate would have given them the key if they had asked. They also seem to have damaged the rear tire, which I noticed quite abruptly as I turned a corner on the exit of the impound lot and nearly went tits up. Jackasses.

So after all this I'm down a huge fine, a very expensive lock, and a rear tire. What for?

Not paying my £15 - yes only £15 tax, within 2 months, when the bike had not been ridden over winter. Maybe if I was riding it at the time I might think differently but really this is too much. 

The punishment does not fit the crime, in my opinion anyway.

Credit Crunch Explained

I just came across this great video for explaining the credit crunch - very informative!


Thursday, 11 December 2008

Html/Url/Javascript Encoding made easy

When developing a web application one of the key security issues to remember is to make sure you apply the correct encoding to any text that is written to the page - either from a resources file, content management system or user generated input. Without this generally two problems can happen. Firstly any external content can produce invalid markup that at its worst can break the layout of your page or prevent JavaScript from executing. For example forgetting to escape a single quotation in a string used in a JavaScript function - such as in the word "won't" will start or end a string. That sounds pretty bad right, as your JavaScript will throw a compilation error and not run. 

 

The second problem you can face is much more sinister however. By making use of this problem a malicious attacker can post some information to your site that will in turn be viewed and executed by unsuspecting users. There are two prominent forms of attacks that developers need to be aware of - XSS (Cross site scripting) and XSRF or CSRF (Cross site request forgery). To very briefly surmise the XSS involves the attacker injecting script that is directed at stealing a users details of your website or otherwise manipulate the users browser. XSRF has been getting a bit of media attention lately, in the context of encoding it is a specialisation of XSS whereby the attacker exploits your website to execute requests on another site that the user may be a registered member of. In this sense your site becomes a vehicle for malicious attacks, and in much the same way that an open SMTP relay will get your mail server blacklisted, you may start to find your site flagged as potentially dangerous. Not good! 

 

Anyway I'm not going to dwell on these topics as there is plenty of good information on the net already - although the statics on the number of sites vulnerable and the time to patch aren't great. The sites that are most at risk are the ones that allow the users to firstly submit information that other readers can view - which is pretty much all the good ones!

 

So we need an easy way to prevent this, and the first level of protection is to encode all of your user input. Unfortunately ASP.NET is incredibly inconsistent in the way that text is encoded. In general myself and my team stay away from ASP.NET web controls (such as Label, Linkbutton etc), and instead favour clean html with the ASP.NET HtmlControls namespace (any standard tag with runat="server") and the odd <asp:Literal /> control. This works well and produces markup that is very similar the final output, which always helps when comparing rendered HTML to an ASPX page, or writing some CSS. 

 

 This takes us back to manual encoding (or using innerText of an HtmlControl). I was looking at the String.Format method and realised that a good way to implement encoding would be with an implementation of IFormatProvider for each of the encoding types. IFormatProvider allows the formatting of an object to a string – the standard ones are NumberFormatInfo, DateTimeFormatInfo and CultureInfo. You can build your own by implementing ICustomFormatter – which only requires one method implementation – Format().

 

My basic implementation makes use of a modified version of the AntiXSS 2.0 library (I have added extra Unicode characters that are safe). After getting it all up and running it is as simple as String.Format(EncodingInfo.JavascriptEncoder, "function() {alert('{0}')}", myString) - and voila – all your text is encoded.

 

There is only one problem however, String.toString(IFormatProvider) is a no-op! This makes sense really when you think about it – why would you need to format a string to string. Encoding text seems like the only instance I can think of so this fair enough but it would have been great to go "unsafe <script> string".ToString(EncodingInfo.HtmlEncoder)! So I’m left with probably creating an extension method for that case – along the lines of "unsafe <script> string".Encode(EncodingInfo.HtmlEncoder)- which is not too bad either.

Sunday, 24 June 2007

More SQL

OPTION (Fast X) hint optimises for returning the first X number of rows first before executing the rest of the query. This can have an effect on overall performance however the tradeoff is that the user can see some results while the rest of the query is executing.

Create Calendar Table
In Visual Studio -> New Project -> Analysis Services Project -> New Data Source
then New Dimension -> Date Template and go from there

Always name table outputs with AS so that the developer is abstracted from the actual table and column names

SQLStressTest - good for testing performance and even debugging.
http://www.apress.com/book/supplementDownload.html?bID=10220&sID=4251

Date Time:
ISO Format is:
yyyy-MM-ddThh:mm:ss.mmm
SQL Server also accepts:
yyyyMMdd hh:mm:ss


Hashing:
SELECT HashBytes('MD2MD4MD5SHASHA1', 'Text')

Dynamic SQL:
sp_executesql - can make use of parameters and hence a) reuse quer planss b) most importantly prevent sql injection.
Useage:
EXEC sp_executesql @sqlStatement, N' @String VARCHAR(255), @Int INT', @SomeString, @SomeNumber

Concurrency Control:
Read Blocking: READ COMMITTED, REPEATABLE READ, SERIALIZABLE
Non Blocking: READ UNCOMMITTED, SNAPSHOT
Usage: SET TRANSACTION ISOLATION LEVEL REPEATABLE READ


Wednesday, 6 June 2007

Google Source Code

If someone stole the source code to Microsoft Windows it might cause a few security issues, but they couldn't do much with it. I would imagine they would get sued pretty hard by the MS team of lawyers. But what about the source to Google Search and Indexing? You could tweak it and then create a rival service without a huge risk of getting busted...well maybe.

I would like to make a search engine one day, its pretty hard though. I wrote some of the blinkBox search engine and its given me a new appreciation for good search results and how difficult they are to achieve. Those blokes at google are pretty sharp I would say.

Lessons in Agile

We have been using an agile based methodology at work for a few months now. Overall it has been a good experience for the developers and the upper management. From a technical management point of view certain areas have worked out very well, such as morning standup meetings and enforcing test and customer reviews. My biggest problem however is that we have used paper cards for the majority of tracking. This has made it quite difficult to plan, view and log work. In retrospect all this information probably needs to be backed by some kind of data store. One of our devs would spend quite a long time saving this info in Excel to produce burn up charts, but historical information was not centrally stored.

We are now investigating Team Foundation Server, though the price has really surprised me, and not in a good way. Hopefully using TFS we can track requests, work items, bugs and even feedback from a central store. I also really like the source control features, such as shelving and advanced branching - although I am yet to see this in a work environment.

Anyway back to agile. I thought I would document some of the lessons that we learned along the way:
  • Morning Standups - otherwise known as scrums - are great. One of the common scenarios we have is that some devs will get caught up in the problem at hand. Anyone can just say "take it offline" (or similar) when they are not interested, ending that conversation and indicating that the relevant parties should continue after the meeting.

  • As I mentioned, make sure you have a central data store for all the work that is carried out. While a card based system works ok for the work in progress, previous or upcoming work can be hard to track.

  • Get a large board, with columns divided up into stages (Not Started, Development, Review, Testing, Customer Review, Complete), while on the other axis divide the rows up by developer. This should make each developer a lot more accountable for any work they have outstanding.

  • Maintain good visibility for the team regarding where they are and what the goals are. This can be achieved verbally in the standup and also by generating a burn up or burn down chart that should be displayed near the tracking board. One of our devs was very good at this and it helps a lot when trying to motivate the team.

  • Get card printed and make sure they are small enough! We used colour coded cards, they are very good and it is easy for anyone from managment or dev to understand. White for Functionality, Red for Bugs, Blue for Technical Debt (When a job will needs to be revisited) and Green for Infrastructure.

  • Enforce testing. We used stickers (shiny stars - like back in school if you did well) on each card indicating the type of test. Red for manual testing, gold for unit tests and silver for selenium tests. Work should not be accepted until the tests have been written.

  • Track work done! Both the time worked, time remaining and the stages the work item has been through (e.g. Development -> Review -> Development -> Review -> Test). This can be easily achieved by colour coded stickers on the top of the cards.

  • Issue a priority for each work item. This makes it easy for devs in a self organised team to override each other without causing an upset. "Can you please work on this with me now, as you can see it is a priorty one bug and I need your input" won't result in anyone getting upset.

  • Make sure you still do appropriate requirements analysis, agile does not mean that the development team can do it their way. Good requirements make for good acceptance criteria.

Tuesday, 5 June 2007

HTTP Compression

I often wonder how does Google compress their pages on every request (Do they compress every request?). Some ideas that have come up with are:

All pages are generated and then compressed once, subsequent requests are issued from a cache - but it troubles me that they would do this. My main issue is that
a) according to the original white paper from Google all pages are stored compressed in the gzip format, hence requiring some decompression before processing. This combined with
b) the search results are highlighted with the exact phase searched.

If this is true then each result must be processed, requiring decompression, then compressed again to send back to the client. This just seems like a waste of processing and given the number of requests they need to serve it just doesn't seem likely.

But can we improve on this, I think so. While I am somewhat familiar with Huffman encoding, I do not know much more about LZ other than what I have read on Wikipedia. These techniques can be used to make up gzip, a likely compression technique used by google. Huffman and LZ both based on generating a key table that maps a common series of bits (or letters) to a shortened binary version
i.e <strong> is 6*8 = 48bits in ASCII (and UTF-8 I think), however a table may map that sequence to a much smaller series of bits, say 12bits due to its relatively high freqency in HTML documents. The basis for this compression is in making frequent patterns short while infrequent pattern are longer, and space is not wasted on non-existent pattens resulting in a shorter overall length. In fact UTF-8 appears to be based on Huffman coding, but that's for another post.

This key table generally changes for every block of data encoded, but then lets think about what Google are compressing - HTML. They may have a predetermined table (or set of tables, perhaps indicated by a header) that is used for all documents. I think that makes sense, and fits with Google publishing statistics on common web tags and attributes, as they may use these statistics to generate new encoding tables.

Now lets assume they don't encode strings with spaces as a combination of characters. For example " we " is 4 characters including spaces, and is hence not a subset of " welcome". These two strings would have completely unrelated compressed character codes. So if we accept the assumption that spaces are not integrated into a keyed entry in a compression table we can do some powerful text manipulation while the content is compressed - although only on full words.

If we compress the search text (or tokens/words) then we can easily compare this to the compressed results - they should have the exact bit pattern if they use the same key table. It may not be quite this simple - a series of other concatenated bit patterns may match your search bit pattern so some parsing logic may be involved. I can't think of a great way of achieving this yet because in order to have unique bit pattern we would have to have a long code. Else we would have to perform some processing on each extended code when generating our key table to make sure that this bit pattern does not repeat. This would adversely affect the compression ratio that could be acheived, although it may be a small trade off overall. Using fixed length key codes may help. For example if each key code resulted in a length that is multiple of 4 bits then this means 8 times (2^4) fewer possible occurences of the unique code in longer codes.

If this process does work then you may now inject some compressed bits into the stream - namely some compressed <strong> tags to highlight the search word(s). Viola, we have read and manipulated the string pattern while it is still compressed without fully decompressing or re-compressing it!

The other reason I like this idea is that by using a common and constant compression table they can make each page request be constructed from parts of HTML. If the key table is the same for all requests, then generating the header is simple and blocks can be concatenated, which is not normally possible as they are based on different key tables. For example a single search result can be concatenated to the previous result, which taking a step back makes a good platform for distributed processing and caching. Once they have built this type of compression engine they can use it for pretty much any page, from news to mail. Maybe that is the Google advantage?

Maybe they don't do anything like this, I really have no idea, but this sounds like a possible solution and meets all the functional and performance requirements that I can think of right now. I might go and investigate it some more.

Any thoughts?