2009-05-28

Who are my readers?

I want to get in touch with my loyal readers, so please share with me some information about your interests and suggestions about this blog.

I hope it will help me to improve the shape of this blog or make connections with interesting people who has wonderful ideas and make cool projects.

Post answers in comments or use my blog email: itprolife@gmail.com if you don't want make it public.

Here is short survey template

Nick/name:

Country:

Your website/blog/other web realm:

Favourite websites:

Interests:

What do you like/hate the most in my blog?

Other suggestions and comments:

Contact (optional)* :


* all fields are optional , so feel free to write anything you want

2009-05-14

Pragmatic printable document format

What open file format comes to your mind when you have to produce mass of personalized electronic documents in enterprise system?

I have reviewing pragmatic solutions for given requirements:
  • ready for batch print out and producing quite good prints on customer's printers
  • keeping documents in electronic version for evidence (content cannot be changed)- the bussiness process is paper based
  • have customizable templates (text, formatting, tables)
  • total cost of process should be optimized (including printing by customer issues)

PDF seems good choice, it is well distributed format and produces nice print outs. I've tried it and it worked quite well. The cons of that solution popped out during batch processing and were: slow process of generating final PDF and bigger files included containing additional font data (national character issues - no problems with standard fonts).

Another approach was using XHTML as final format with ability to print from web browser. That needs more attention on formatting and automatic flow, but it makes good prints also.

Users can print documents using both formats without problems. XHTML was faster to produce and worked with standard system fonts, so files were in most cases smaller than their PDF counterparts.

That was scenario for paper dominated document flow process. Electronic document processing requirements need also easy and automatic access to important data in document content. The best standard for that seems XML containing structured data. That plus XSLT transformation for description of form view, may be used to produce printable XHTML or PDF (with help of FOP (Formatting Objects Processor)).

New document formats like ODF are based on similar ideas so future lays somewhere here, just new solutions need more adoption.

2009-05-04

Removing page from Google index and cache

Everybody wants to get search engines attention for their public content. But there are situations when something improper leaks to public by mistake, and then we are trying to reduce harm caused by an incident.

That happened to one of our clients so I get a request to remove particular page from Google index and cache. Of course it wouldn't help much for content that was hosted for about 1.5 year - if people found it interesting, there is practically no possibility to remove that content from the Internet. But most people are seeing only Google results so decision was to target Google.

It seemed simple and obvious. The possible methods are:
  • filtering page by robots.txt
  • meta tags for bots in page html
  • 404 http answer

Google support pages has detailed procedure about removing individual pages from index and from cache.
There are obvious assumptions that could be easily forgotten. Your site needs to have valid robots.txt file or be hosted without robots.txt. In case when the Google crawler engine assume that something is wring with your robots.txt it even don't move any further to update your site index or cache. After some tweaking it seemed that case is ended. I registered urgent request removal on webmaster tools. And waited for crawlers.

Everything should work as planned? Wrong!

After few days I get my "urgent URL removal request" denied. Reason?

"The content you submitted for cache removal appears on the live third-party page. ....
As you may know, information in our search results is actually located on third-party, publicly available webpages. Even if we removed this page from our index, the content in question would still be available on the web"


No clues about what that third-party page is. Probably crawler engine can't resolve issues caused by multihosting on my server plus couple of DNS names for requested site. I'm still trying to get more information about problem.

UPDATE

Adding second domain to Google webmaster tools helped. That time I requested page removal from cache. Indexed URL is actual also.