Deep Web Crawling

Simulation human browsing behaviour websites and forums are scanned in order to extract content fitting to specific search parameters.
Link structures are followed into depth incl. automatic login, bypassing ip and country locks, solving captchas, link redirects and camouflage mechanisms for hidden content.

Internet Forensics

Based on technical investigations we answer questions about how is responsible for online content, who is providing the information technically and where is the content being stored.
Special needs may be met by additional social engineering.

📈

Big Data Analysis

Unstructured data from heterogeneous sources is normalized and put into pre-definied structures for categorization, evaluation and approval.
A combination of centralized definition of rules, decentralized editing clients and expert system learning routines optimizes human machine interaction.

🔍

Automated Online Research

Search engines are automatically fed with optimized query variations, the provider´s ranking of results as well as restrictions the amount of of results are circumvented.
Regular research tasks are performed by parallel batch processing, different perspectives depending on querying country can be made visible.

Automated IT Workflows

Our specialized scripting environment featuring pre-defined in- and output formates along with a specialized frameset of .NET classes enables rapid development of multithreaded command line tools which can be deployed and scheduled on-the-fly for parallel computing in the cloud.

Deep Web Crawling

Just because a Web search engine can't find something doesn't mean it isn't there. You may be looking for info in all the wrong places. The Deep Web is a vast information repository not always indexed by automated search engines. The Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines. The Deep Web consists of Web pages that search engines cannot or will not index. The popular term "Invisible Web" is actually a misnomer, because the information is not invisible, it's just not bot indexed. Depending on whom you ask, the Deep Web is five to 500 times as vast as the Shallow Web, thus making it an immense and extraordinary online resource. Do the math: If major search engines together index only 20% of the Web, then they miss 80% of the content.

Search engines typically do not index the following types of Web sites:

  • Proprietary sites
  • Sites requiring a registration, eg forums
  • Sites with scripts
  • Dynamic sites
  • Ephemeral sites
  • Sites blocked by local webmasters
  • Searchable databases

Proprietary sites require a fee. Registration sites require a login or password. A bot can index script code (e.g., Flash, JavaScript), but it can't always ascertain what the script actually does. Some nasty script junkies have been known to trap bots within infinite loops.

Dynamic Web sites are created on demand and have no existence prior to the query and limited existence afterward (e.g., airline schedules).

If you ever noticed an interesting link on a news site, but were unable to find it later in the day, then you have encountered an ephemeral Web site.

Webmasters can request that their sites not be indexed (Robot Exclusion Protocol), and some search engines skip sites based on their own inscrutable corporate policies.

  • Following down link structures until relevant content is found

  • 🔑

    Automatic login for websites requiring registration (forums)

  • Bypassing captchas, link referrers and other kinds of content cloaking

  • Using internal search functions of websites

  • 🌎

    Circumventing ip and country blocks

  • 🎓

    Automatic filtering and rating of content based on user specific rules

Internet Forensics

Internet Forensics uses the combination of advanced computing techniques and human intuition to uncover clues about people and computers involved in Internet crime, most notably fraud and identity theft.

All those who own websites, store vital information online or transact over the internet are always under constant threat of falling victims of internet attack. Internet forensic is therefore very important in making the internet a safe platform of transacting.

We distinguish different aspects of internet forensics:

  • Email Forensics

    Studying the source and content of electronic mail as evidence, identifying the actual sender and recipient of a message and the physical location from which it was sent through e-mail routing, as well as finding out the date/time etc. Another part of email forensics is the investigation of lost emails, i.e. at what point was an email interrupted on it's route (blacklisting, spam filters etc.)

  • Web Forensics

    Answering questions on who is responsible for content published online, who is providing information frmo a technical point of view and where information available online is actually stored. Web forensics also refer to the monitoring of traffic on a webpage (eg how many people have visited, how long they visited for) to help judge how effective a web presence is and what impact it may have.

    Other aspects of web forensics eg relevant for companies´ internal security are analyzing things like browsing history and web activity of computers to check for suspicious usage or content that has been accessed.

  • Network Forensics

    Network forensics are concerned with the monitoring and analysis of computer network traffic, both local and WAN/internet, for the purposes of information gathering eg used for prevention or monitoring of unauthorized access to a network.

Sometimes technical methods are not sufficient to gather all forensic information relevant. In these cases work of human online detectives are needed. In this context we apply the same methods cyber criminals use and use them against them. This kind of non-technical methods heavily relying on human interaction are usually refered to as Social Engineering.

  • Determining content responsibility of information published

  • Determining technical providers of services and actions

  • 🌎

    Determining locations of services and information storage

  • 🔦

    Detecting attacks and frauds

  • 🔄

    Helping your systems keep running undisturbed

📈 Big Data Analysis

Big Data is a broad term for data sets that are so large or complex that traditional methods of data processing are inadequate. The challenges in this context include the analysis, acquisition, data curation, search, sharing, storage, transmission and visualization of Big Data. Often, the term Big Data is mainly related to the use of statistical methods by which application-related statements are extracted from existing data. In contrast, when dealing with Big Data we focus on the reduction and normalization of vast amounts of unstructured data from heterogeneous sources to make them available for effective and targeted evaluation. By reducing the amount of data adapted to the respective application situation Big Data analysis is possible quickly and inexpensively.

Main aspects of our concept for dealing with big data:

  • Defining questions to be answered and data source to be used beforehand in collaboration with our client
  • Collecting and normalizing the data
  • Integrating the data into the pre-defined structure
  • Automatically evulating the data according to rule sets deduced from the client´s original questions
  • Catogorizing the data
  • Approving the data by human editors
  • Calculating trends, key figures and others statics
  • Producing reports helping to answer the client´s questions

Following these steps unmanagable and confusing masses of data can incrementally be condensed in order to be worked with in an efficient manner. We call the approach to big data analysis the

Funnel-Approach to Big Data

  • Cost-efficient collection and evaluation of big data waiving any unnecessary expenses

  • Central filing of rules for filtering and evaluation

  • Decentralized editing environment with optimized human-machine interaction

  • 🎓

    Intelligent data assessment according to the principle of exert systems

  • 📄

    Extensive logging of data processing and human editing

  • Open interfaces for additional / supplementary data

🔍 Automated Online Research

Online research usually starts with collection data, very often by querying search engines. Combining search results from different sources (eg results coming from search engines and specific online databases) lead to considerable research expense in time and money. Combining automated search engine queries with deep web crawling technologies we can offer a significant increase of efficiency and comfort performing online research in an automated and integrated way.

Problems and inconveniences of search engine usage


Web search engines work by storing information about many web pages, which they retrieve from the HTML markup of the pages. The search engines then analyze the content of each page to determine how it should be indexed. For example, terms can be extracted from titles, page content, headings, or special fields called meta tags. The data about web pages analysed this way are stored in an index database for use in later queries. Queries by users are performed on this index helping to get the query results as quickly as possible. On the other hand this indexing mechanism harvours servere problems and inconveniences for the user:

  • Weighting of results based on user behavior

    For many private users is convenient that nowadays search engines weight results to a user´s query according to the user´s previous behaviour. For business users this can be very annoying since often neutral results would be preferred which then could be ranked according to the specific research context.

  • Weighting of results based on popularity

    Similarily any ranking of results according to popularity puts backwards interesting results which are less popular to the majority of online users making it very time consuming and for manual research de facto impossible to evaluate these.

  • Filtering results according to the geografic origin of requests

    Since for most users it is comfortable to receive query results best fitting to their own geografic and cultural context search engines filter results accordingly. Therefore the results to the same query will generally be different if one requests it using ip adresses from originating in different countries. In many business settings it can be very interesting to get results as seen from a different country though, being able to compare the different regional perspectives on query results.

  • Limitations to the amount and speed of successing queries

    In order to prevent performance bottlenecks the query rate is strictly limited by search engines.

  • Limitations by virtue of the sheer amount of data available

    The simple fact that the amount of data is very huge can prevent a complete collection of data which in principle is available by search engines.

Our search engine automization is putting an end to all limitations listed above. Optimized search queries are transmitted to search enginges in a neutral manner without link to any previous user behaviour. Parallel batch processing allows for complete collection of search results which then can be weighted invidually according to the needs of our customer. A simulation of different user context is possible on demand as well eg simulating requests from different georgrafical regions or using various software environments.

  • Integrating the use of search engines with deep web crawling

  • Individual query optimization using advanced query configuration options

  • Multithreaded batch processing of query requests

  • Unlimited collection of results

  • Individual filtering and ranking of results based on pre-defined evaluation rules

  • 🔀

    A combination with results from different search engines an others sources is possible

  • 🌎

    Comparing different user contexts eg simulating geografical and languages settings

Automated IT Workflows

By means of our intelligent scripting environment chains of specialized programm (software agents) are created which reproduce complete workflows of data processing. These processing chains can be deployed to Windows or Linux machines and run individually scheduled without human intervention. If appropriate processing may be deployed to cloud computing environments enabling any scaling desired.

In general software agents may perform any kind of data processing tasks such as:

  • Validation
    Ensuring that supplied data is "clean, correct and useful"
  • Sorting / Ranking
    Arranging items in some sequence and/or in different sets according to the data´s qualities
  • Transformation
    Changing the data´s format and/or storing it to a different place and/or environment
  • Summarization
    Reducing detail data to its main points
  • Aggregation
    Combining multiple pieces of data to new qualities which are more meaningful
  • Combination
    Combining data from different sources and/or with different/complementing qualities
  • Analysis
    Getting meta information on structure and qualitites of data given
  • Reporting
    Getting overview information on existing data eg trends or key figures
  • Classification
    Indexing by meaningful categories

In principle these tasks are just basic IT applications already being part of everyday life. But if one looks at typical business situations some irritating obstacles can be identified, which deserve closer observation:

  • Data is available in incompatible formats
  • Data is stored at in unconnected places
  • Different data sets are not sufficiently linked
  • Processing requirements are often very similar, but very seldom identical
  • Data can be uncomplete and "dirty"

Therefore tools used for data processing need constant adjustments, new tools for suddenly appearing tasks have to be created.

Our scripting environment with integrated framework for data processing facilitates and accelerates these tasks and enables

  • Rapid programming of new solutions and interfaces
  • Quick and efficient scaling
  • Easy adaption and reuse of existing solutions
  • Convenient integration of diverse IT landscapes
  • Flexible compilation of workflow scenarios
  • 💻

    Entire .NET development environment with code completion

  • Specialized framework for data manipulations (data extraction, crawling, fuzzy matching and more)

  • Various pre-defined in- and output channels (databases, different file formats, queueing systems and more)

  • Multithreaded batch processing

  • Deployment for Windows and Linux (Mono)

  • 🎯

    Normalized internal JSON format

  • 💿

    Complete workflows can be mapped by chains of subsequent and interdepended software agents

  • Instance management for running tools with scheduling

  • Cloud deployment (eg using Amazon could) and management

Our references include

GEMA

GEMA represents the copyrights of more than 65,000 members in Germany.

Arvato

Development and management of processes based on the latest technology outsourcing.

Freudenberg

Internationally oriented company with technically leading products, solutions and services.

University of Heidelberg, Department of Research

Support for research scientists. A subdivision offers convention management supporting in design, calculation and organization of conventions and other events.

KM matches & lighters

Sales of matches and promotional materials.

Stamm Showers

Sprays and moving means for the pulp industry and drainage technology.

IHK Rhein-Neckar

The Chambers of Commerce offer service for the economy and fulfillment of state duties at their service.

Bavarian Commodities Exchange

The Bavarian Commodities Exchange in Munich has since been the place where agricultural products for southern Bavaria are traded and quoted.

Contact Us / Request Callback

We will get back to you as soon as possible.