The Internet is all about requesting information from the web server and accepting the information from the same by the customers. This request-service procedure takes place over HTTP which is a stateless (or session less) protocol. As the name suggests, the session is not bound by the HTTP. It just records "events" that occur in this "request-service" procedure. These events are what the lines in the log files of web server reveal.
For analysis, more often than not, the requirement is to report data by binding events as a visit or a session. But with just the log files, it is hard to get sessions, but possible to a maximum extent nonetheless. A few easy ways to solve this problem will be dealt in this article.
You can always create a unique id for each of the visitor to your website and implant it on the visitor's machine in the form of a cookie, which can be read only by your website domain. You can also track sessions based on the unique session id that can also be implanted as part of the cookie. So, now you can, not only track the sessions of visitors but can also differentiate between return visitors and new visitors.
The other side of the story however is that, cookies don't always work. Consider a scenario where the visitor deletes the cookies. This means, though the visitor is coming to your website after a previous visit, you will still consider him as a new visitor. One more scenario is where the visitor disables the cookies for your domain on his browser. You will not get his session details at all; forget the question of new or return visitor. Worse, you will count the events of a single session as different ones as each would have different unique session id created.
This is another workaround for visitors who do not accept cookies. You can bundle the session id with the query parameters of the URL and pass it on to all the pages browsed during that session. This job is a bit tedious as it involves carefully coding the links so as to not lose the session id at any stage. And if the website is frequently being updated with new pages and alterations happen on the existing pages, the QA team has to be very careful and check that the chain is not broken at any link.
This is the crudest method of sessionizing the log file. All the entries in the log file having the same IP Address and the same User Agent are assumed to belong to the same visitor. The timestamp factor comes in when you define the session. You will have to take a stand on what would be the maximum duration between 2 consecutive entries that can be considered as part of the same session. With this method, you need not depend upon any other information than what is given in the web server log file.
Again, the drawback of this method is that your website could have visitors from ISP who share the same IP Address and to worsen the situation, have the same User Agent in their browser. In such a scenario, you will be counting less number of visitors to your website. The flip side of this is when the IP address of a visitor, using the ISP, changes. Then you will actually be counting more sessions than what exist.
These are the kind of problems faced regularly by the technology team of the web analytics department. And we continue to solve such problems in an innovative fashion.