How to send large amounts of data securely

RSA public/private key cryptography is limited in the amount of data that it can encrypt. With the commonly used v1.5 padding and the largest RSA key (currently 2048-bits), the maximum size of data that can be encrypted is 245 bytes.

If you want to encrypt and securely send more data than that you need to use a combination of asymmetric and symmetric encryption algorithms.

In practice, RSA is used to exchange an encrypted key between communicating endpoints that is then used to symmetrically encrypt/decrypt the large data.

Encryption by sender:

  1. Generate a cryptographically strong random key, K, of the length required for a symmetrical encryption technique such as Rijndael (maximum key size is 256 bits for Rijndael).
  2. Symmetrically encrypt your data using Rijndael using the random key generated in step 1.
  3. Using RSA, asymmetrically encrypt the random key generated in step 1 with the public part of the recipient’s RSA key.
  4. Send the RSA encrypted random key AND the encrypted data from steps 2 and 3 to recipient.

Decryption by recipient:

  1. Decrypt the encrypted key using your private RSA key.
  2. Decrypt the original data using the RSA-decrypted symmetric key from step 1.

Here’s how to generate a 2048 bit public/private key in C#:

    using (var rsaProvider = new RSACryptoServiceProvider(2048))
    {
        rsaProvider.PersistKeyInCsp = false;

        // Export public key to file
        var publicKey = rsaProvider.ToXmlString(false);
        using (publicKeyFile = File.CreateText(publicKeyFileName))
        {
            publicKeyFile.Write(publicKey);
        }

        // Export private/public key pair to file
        var privateKey = rsaProvider.ToXmlString(true);
        using (var privateKeyFile = File.CreateText(privateKeyFileName))
        {
            privateKeyFile.Write(privateKey);
        }
    }

The Fall of RNN / LSTM

Good article on “hierarchical neural attention encoders” the next evolution in neural network designs.

Then in the following years (2015–16) came ResNet and Attention. One could then better understand that LSTM were a clever bypass technique. Also attention showed that MLP network could be replaced by averaging networks influenced by a context vector.

SQL Server Unindexed Foreign Keys

I saw this, DMV To List Foreign Keys With No Index, via Brent Ozar’s weekly links email.

Unindexed foreign key columns might not be captured by the sys.dm_db_missing_index_details DMV because of their relatively small size. Lack of indexes on foreign keys might only have a small performance impact during reads but can lead to lock escalations during heavy write loads causing excessive blocking and possibly dead locks.

I’ve updated the original posted query to generate TSQL to create the missing indexes (which you should compare to the existing index landscape to see if any indexes can be consolidated before running in).

[Note: if you are unfortunate enough to have spaces in your table/column names, then you’ll need to replace them with an underscore ‘_’  (or other character) in the index name.]

;with cte_fk as 
( 
    select   
        fk_table_schema = OBJECT_SCHEMA_NAME(fk.parent_object_id),
        fk_table = OBJECT_NAME(fk.parent_object_id),
        fk_column = c.name,
        fk_name   = fk.name,
        fk_has_index = CASE WHEN i.object_id IS NOT NULL THEN 1 ELSE 0 END,
        is_fk_a_pk_also = i.is_primary_key,
        is_index_on_fk_unique = i.is_unique,
        index_def = 'create index NC_' + OBJECT_NAME(fk.parent_object_id) + '_' + c.name + 
           ' ON ' + QUOTENAME(OBJECT_SCHEMA_NAME(fk.parent_object_id)) + '.' + QUOTENAME(OBJECT_NAME(fk.parent_object_id)) + '(' + QUOTENAME(c.name) + ')',
        pk_table_schema = OBJECT_SCHEMA_NAME(fk.referenced_object_id),
        pk_table = OBJECT_NAME(fk.referenced_object_id),
        pk_column = c2.name,
        pk_index_name = kc.name,
        fk.*
    from     
        sys.foreign_keys fk
        join sys.foreign_key_columns fkc ON fkc.constraint_object_id = fk.object_id
        join sys.columns c ON c.object_id = fk.parent_object_id AND c.column_id = fkc.parent_column_id
        left join sys.columns c2 ON c2.object_id = fk.referenced_object_id AND c2.column_id = fkc.referenced_column_id
        left join sys.key_constraints kc ON kc.parent_object_id = fk.referenced_object_id AND kc.type = 'PK'
        left join sys.index_columns ic ON ic.object_id = c.object_id AND ic.column_id = c.column_id
        left join sys.indexes i ON i.object_id = ic.object_id AND i.index_id = ic.index_id
)
select  
    * 
from    
    cte_fk c
    left join sys.dm_db_partition_stats ps on ps.object_id = c.parent_object_id and ps.index_id <= 1
where   
    fk_has_index = 0 
    -- and fk_table = 'mytablename'
order by 
    used_page_count desc

Some TSQL Anti-patterns: ‘DISTINCT’ column list, Scalar valued Functions…

I’ve been looking through some code that I last worked on over a year ago. Since then several people have contributed to the project. The first thing that I noticed was a sprinkling of DISTINCT added to SELECT queries where it was originally totally unnecessary! Paul White puts it better than I could:

SELECT DISTINCT is sometimes an indication that someone has ‘fixed’ a query that returns duplicate rows in a naive way. This (mis-)use is probably more common among those with relatively little database experience.” https://dba.stackexchange.com/questions/139687/when-to-use-distinct

If you’re a Dev or DBA and you’re about to fix a problem by adding DISTINCT to a query, ask yourself “I am really fixing the real problem?”. And if DISTINCT is the real answer, then an equivalent GROUP BY might be more efficient.

The next thing I noticed is the addition of scalar valued functions not unsurprisingly causing less than stellar performance. An experienced DBA (and developer) should be fully aware of the performance implications of using scalar valued functions, and should avoid if possible.

 

SQL Server 2012 Error: No catalog entry found for partition ID xxx in database N

If you are running SQL Server 2012 and you see this error:

DESCRIPTION: No catalog entry found for partition ID xxx in database N.
The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata 
corruption.

a quick check reveals database 2 is tempDB:

SELECT name FROM sys.databases WHERE database_id = 2;

If you run a DBCC CHECKDB against tempDB:

DBCC CHECKDB ('tempdb') WITH NO_INFOMSGS, TABLERESULTS;

you receive no results, indicating no issues with tempDB.

This is a known issue with a fix:

FIX: “No catalog entry found for partition ID in database ” error when you use SQL Server 2012

Assume that you query the tempdb.sys.allocation_units table in Microsoft SQL Server 2012. When you use NOLOCK hint in the query or the query is under the READ UNCOMMITED transaction isolation level, you receive the following intermittent 608 error message:

Error: 608 Severity: 16 State: 1
No catalog entry found for partition in database . The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata corruption

Note The DBCC CHECKDB command does not show any sign of database corruption.

Fixed in:

 

SSMS: Tips and Tricks to Enhance Productivity

 

Set colours to differentiate between servers/environments

You can set SSMS connection properties to give a visual indication of which server your queries are connected to.

When you connect to a SQL Server instance, click on the ‘Options’ button:

ssms1

Then click on the ‘Connection Properties’ tab and choose a custom colour for your connection:

ssms2

Suggested colours for your environments:

  • Production – Red
  • UAT – Orange
  • QA – Yellow
  • Dev – Blue
  • Local – Green

Once set, every time you open a connection to a server, it will display the assigned colour in the SSMS status bar.

Configure SSMS tabs to only show file names

Follow Brent Ozar’s simple instructions here: SSMS 2016: It Just Runs More Awesomely (It’s not just for SSMS 2016). This makes tabs easier to read and pinning tabs is a great idea for often used scripts. [I also like to set my status bar position to the top of the query window]:

ssms3

While you are in the Options dialog, go to Tools -> Options -> Environment -> AutoRecover and make sure AutoRecover files is turned on, with appropriate values set.

Cycle through clipboard text

All Windows users will be familiar with the shortcut keys CTRL+C and CTRL+V. The ‘Cycle Clipboard Ring’ feature in SSMS keeps track of last 20 items you have cut/copied. You can use CTRL+SHIFT+V to paste the last copied item from the clipboard just as you would with CTRL+V. If you repeatedly press CTRL+SHIFT+V, you cycle through the entries in the Clipboard Ring, selecting the item you want to paste.

This also works in Visual Studio 2015+

List all columns in a table

To quickly list all the columns in a table as a comma separated list, simply drag the ‘Columns’ folder in Object Explorer and drop it onto a query window. This creates a single line of comma separated column names; if you want to format as one column per line, you can use a search and replace utilising a newline with the regex search option turned on.

Highlight the comma separated list of columns you just created, type CTRL+H, turn on regular expression searching, enter a comma followed by a space  as the search text, and replace with a comma followed by a newline ,\n

ssms4

Disable Copy of Empty Text

Ever had this happen to you? You select a block of text to copy, move to the place you want to paste it, and then accidentally hit CTRL+C again instead of CTRL+V. Your block of copied text has been replaced by an empty block!

You can disable this behaviour (I have no idea why disabled is not the default): go to Tools -> Options -> Text Editor -> All Languages -> General -> ‘Apply Cut or Copy Commands to blank lines when there is no selection’ and uncheck the checkbox.

ssms7

Set Tabs to Insert 4 Spaces

Avoid indentation inconsistencies when opening TSQL files in different editors: go to Tools -> Options -> Text Editor -> Transact-SQL -> Tabs -> Insert Spaces and click the radio button. Set Tab and indent size to 4.

Use GO X to Execute a Batch or Statement Multiple Times

The ‘GO’ command is not a Transact SQL statement but marks the end of a batch of statements to be sent to SQL Server for processing. By specifying a number after ‘GO’ the batch will be run the specified number of times. You can use this to repeat  statements for creating test data. This can be a simpler alternative to writing a cursor or while loop.

create table MyTestTable
(
Id int not null identity(1,1) primary key,
CreatedDate datetime2
)
GO

This will run the insert statement 100 times:

insert into MyTestTable(CreatedDate)select GetDate()
GO 100

Templates and Code Snippets

Many users are not aware of SSMS’s Template Browser. These templates contain placeholders/parameters that help you to create database objects such as tables, indexes, views, functions, stored procedures etc.

By default when you open SSMS, the Template Explorer isn’t visible.  Press Ctrl+Alt+T or use the View -> Template Explorer menu to open it. One of my favourite templates is the database mail configuration:

ssms8

Template Explorer provides a view of a folder structure inside the SSMS installation, which is located at C:\Program Files (x86)\Microsoft SQL Server\XXX\Tools\Binn\ManagementStudio\SqlWorkbenchProjectItems\Sql

Templates contain parameter place holders: press Ctrl + Shift + M to open a dialog box that substitutes values for the template place holders:

ssms9

You can also add your own templates. Right-click on the SQL Server Templates node of the Explorer and choose New -> Folder and set the folder name. Then right-click on the folder and choose New -> Template. Add your code, with any parameters defined as:

< ParameterName, Datatype, DefaultValue >

Click Ctrl + Shift + M to check the parameter code blocks are well formed.

Code snippets are similar but simpler without parameters. Type CTRL + K + X to insert a code snippet.

Registered Servers

Most users have a number of servers they frequently connect to. The Registered Servers feature allows you to save the connection information of these frequently accessed servers.

You can create your own server groups, perhaps grouped by environment or by project.

Navigate to View -> Registered Servers. Then right-click on the ‘Local Server Groups’ and click on ‘New Server Registration’, and enter your connection details.

There is also a feature that allows windows only authentication to be used against a central server management server

Built in Performance Reports in SSMS

SSMS provides a number of built in standard reports. To access the database level reports, right click on a Database –> Reports –> Standard Reports –> Select a Report:

ssms6

 Useful SSMS Keyboard Shortcuts

Shortcut Action
CTRL+N Open new query with current database connection
CTRL+O Open a file in a new tab with current database connection
CTRL+R Toggle between displaying and hiding Results Pane
CTRL+M Include actual query execution plan
CTRL+L Display estimated query execution plan
CTRL+TAB Cycle through query windows
F4 Display the Properties Window
CTRL + ] Navigate to the matching parenthesis
CTRL+ALT+T Open Template Explorer
CTRL+SHIFT+M Specify values for Template parameters
CTRL+K+X Insert SQL code snippets
CTRL+SHIFT+U Change text to upper case
CTRL+SHIFT+L Change text to lower case
CTRL+K+C / CTRL+K+U Comment / Uncomment selected text
CTRL+F / CTRL+H Find / Replace

Splitting the Query Window to work on large queries

The query window can be split into two panes so that you can view two parts of the same query simultaneously. To split the window, simply drag the splitter bar at the top right hand side of the query window downwards. Both parts of the split window can be scrolled independently. This is useful if you have a large query and want to compare different parts of the same query.

ssms5

Vertical Block Select Mode

This is a feature I use often. You can use it to select multiple lines or a block of text over multiple lines, you can type text and it will be entered across all the selected rows, or you can paste blocks of text. To use it, hold down the ALT key, then left click on your mouse to drag the cursor over the text you want to select and type/paste the text you want to insert into multiple lines.

Keyboard Shortcut – ALT + SHIFT + Arrow Keys
Mouse – ALT + Left-Click + Drag

Object Explorer details

The Object Explorer Details window is a feature which very few developers use (including me, as I always forget it’s there!). It lists all the objects in a server and additional information like Row Count, Data Space Used, Index Space Used etc. It’s a quick way to see table row counts for all tables in a database.

The Object Explorer Details window is not visible by default. Click F7 or navigate to View -> Object Explorer Details to open it. To add columns, right click on the column header row and select those columns you want to see.

ssms10

Display Query Results in a Separate Tab

If you want to focus on the results after you run a query, and would like to give it as much screen real estate as possible, go to Tools -> Options -> Query Results -> SQL Server -> Results To Grid and enable the option “Display Results in a separate tab”.

What is Feature Engineering?

These are my notes that I condensed from here:

https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

Feature engineering is the technique of extracting more information from existing data. You are not adding any new data, but you are making the data you already have more useful to a machine learning model.

From https://en.wikipedia.org/wiki/Feature_engineering :

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.”

The feature construction/enrichment process is absolutely crucial to the success of machine learning (with the exception that many neural networks do not require this step to perform well). Feature engineering enables you to get the most out of your data for building predictive models.

Feature engineering is performed once you have completed the first data exploration steps:

  1. Variable Identification
  2. Univariate, Bivariate Analysis
  3. Missing Values
  4. Imputation
  5. Outliers Treatment

Feature engineering can be divided in 2 steps:

  • Variable Transformation
  • Variable/Feature creation

Variable Transformation:

  • Scaling and Centering (termed Standardisation): Standardisation is essential for scale dependent learning models such as Regression and Neural Networks. This technique ensures numerical features have a mean of 0 and a standard deviation of 1
  • Normalisation: restricting to a min-max range

Variable/Feature creation:

  • Missing data Creation using domain knowledge, possibly gained from the data or a domain expert (overlaps with Missing Values/Imputation)
  • Feature Selection (selecting the most important features): Investigating feature correlation
  • Creating new features (hyper-features) by combining existing ones, such as summary data (min, max, count) and discretised data

Feature Engineering Concepts

Categorical Feature Decomposition

Imagine you have a categorical feature “Cabin” that can take the values:

{A, B, C or Unknown}

The Unknown value is probably special, representing missing data, but to a model it looks like just another categorical attribute.

To encode this extra information you could create a new binary feature called “HasCabin”, taking the values 1 and 0 when an observation has a cabin or when the occupancy is unknown, respectively.

Additionally, you could create a new binary feature for each of the values that ‘Cabin’ can take: i.e. four binary features: Is_CabinA, Is_CabinB, Is_CabinC and Is_CabinUnknown.  This is often referred to as ‘One-Hot Encoding’ [Python’s Pandas library has a built-in method to perform this called get_dummies() ]

These additional features could be used instead of the HasCabin feature (if you are using a simple linear model) or in addition to it (if you are using a decision tree based model).

DateTime Feature Decomposition

Date-times are essentially integer numerical values that contain information that can be difficult for a model to take full advantage of in a raw form.

There may be cyclical/seasonal relationships present between a date time and other attributes, such as time of day, day of week, month of year, quarter of year, etc.

For example, you could create a new categorical feature called DayOfWeek taking on 7 values. This categorical feature might be useful for a decision tree based model. Seasonality, such as QuarterOfYear, might be a useful feature.

There are often relationships between date-times and other attributes; to expose these you can decompose a date-time into constituent features that may allow models to learn these relationships. For example, if you suspect that there is a relationship between the hour of day and other attributes (such as NumberOfSales), you could create a new numerical feature called HourOfDay for the observation hour that might help a regression model.

Numerical Feature Transformation

Continuous numerical quantities, might benefit from being transformed to expose relevant information. This includes standardisation which is essential for scale dependent learning models, or transforming into a different unit of measure or the decomposition of a rate into separate time period and quantity.

For example, you may have a ShippingWeight quantity recorded in grams as an integer value, e.g. 9260. You could create a new feature with this quantity transformed into rounded kilograms, if the higher precision was not important.

There may be domain knowledge that items with a weight above certain thresholds incur a higher rates. That domain specific threshold could be used to create a new binary categorical feature ItemWeightAboveXkg

Discretisation

Binning, also known as quantisation, is used for transforming continuous numeric features into discrete ones (categories). A continuous numerical feature can be grouped (or binned) into a categorical feature.

It can be useful when data is skewed, or in the presence of extreme outliers.

In Fixed-width binning, each bin has a specific fixed width which are usually pre-defined by analysing the data and applying domain knowledge. Binning based on rounding is one example.

The drawback to using fixed-width bins is that some of the bins might be densely populated and some of them might be sparsely populated or empty. In Adaptive binning we use the data distribution to allocate our bin ranges.

Quantile based binning is a good strategy to use for adaptive binning.